AI Regulation in Latin America: Overview and Emerging Trends in Key Proposals
The widespread adoption of artificial intelligence (AI) continues to impact societies and economies around the world. Policymakers worldwide have begun pushing for normative frameworks to regulate the design, deployment, and use of AI according to their specific ethical and legal standards. In Latin America, some countries have joined these efforts by introducing legislative proposals and establishing other AI governance frameworks, such as national strategies and regulatory guidance.
This blog post provides an overview of AI bills in Latin America through a comparative analysis of proposals from six key jurisdictions: Argentina, Brazil, Mexico, Colombia, Chile, and Peru. Except for Peru, which already approved the first AI law in the region and is set to approve secondary regulations, these countries have several legislative proposals with a varied level of maturity, with some still being in a nascent stage and others more advanced. Some of these countries have had simultaneous AI-related proposals under consideration in recent years; for example, Colombia and Mexico currently have three and, respectively, two AI bills under review1 and both countries have archived at least four AI bills from previous legislative periods.
While it is unclear which bills may ultimately be enacted, this analysis will provide an overview of the most relevant bills in the selected jurisdictions and identify emerging trends and divergences in the region. Accordingly, this analysis was based on at least one active proposal from each country that either (i) targets AI regulation in general, instead of providing technology-specific or sector-specific regulation; (ii) has similar provisions and scope to those found in other more advanced proposals in the region, or (iii) seems to have more political support or is considered the ‘official’ proposal by the current administration in that country – this is particularly the case of Colombia, for which the present analysis was performed considering the proposal introduced by the Executive. Most of these proposals have a similar objective of regulating AI comprehensively through a risk-tiered approach. However, they differ in key elements, such as in the design of institutional frameworks and the specific obligations for “AI operators.”
Overall, AI bills in Latin America:
(i) have a broad scope and application, covering AI systems introduced or producing legal effects in national territory;
(ii) rely on an ethical and principle-based framework, with a heavy focus on the protection of fundamental rights and using AI for economic and societal progress;
(iii) have a strong preference for ex ante, risk-based regulation;
(iv) introduce institutional multistakeholder frameworks for AI governance, either by creating new agencies or assigning responsibility to existing ones, and
(v) have specific provisions for responsible innovation and controlled testing of AI technologies.
1. Principles-Based and Human Rights-Centered Approaches are a Common Theme Across LatAm AI Bills
Most bills under consideration are heavily grounded on a similar set of guiding principles for the development and use of AI, focused on the protection of human dignity and autonomy, transparency and explainability, non-discrimination, safety, robustness, and accountability. Some proposals explicitly refer to theOECD’s AI Principles, focused on transparency, security, and responsibility of AI systems, and to UNESCO’s AI Ethics Recommendation,which emphasizes the need for a human-centered approach, promoting social justice and environmental sustainability in AI systems.
All bills reviewed ground the development of AI in privacy or data protection as a guiding principle to indicate that AI systems must be developed under existing privacy obligations and comply with regulations in terms of data quality, confidentiality, security, and integrity. Notably, the Mexican bill and the Peruvian proposal – the draft implementing regulations for its framework AI law – also include privacy-by-design as a guiding principle for the design and development of AI.
The inclusion of a principle-based approach is flexible and provides room for future regulations and standards, considering the evolution of AI technologies. Based on these guiding principles, most bills authorize secondary regulation by a competent authority to expand on the provisions related to AI user rights and obligations.
In addition, most bills concur in key elements of the definition of “AI system” and “AI operators.”Brazil’s and Chile’s proposals have a similar definition of an AI system to that found in the European Union’s Artificial Intelligence Act (EU AI Act), defining it as a ‘machine-based system’ with varying levels of autonomy that, with implicit or explicit objectives, can generate outputs such as recommendations, decisions, predictions, and content. Both countries’ bills also define AI operators as the “supplier, implementer, authorized representative, importer, and distributor” of an AI system.
Other bills include a more general definition of AI as a ‘software’ or ‘scientific discipline’ that can perform operations similar to human intelligence, such as learning and logical reasoning – an approach which reminds of the definition of AI in Japan’s new law. Peru’s regulation lacks a definition for AI operators, but includes one for AI developers and implementers; and Colombia refers to “AI operators” in similar terms to those found in Brazil and Peru, though it also includes users within its definition of “AI operators”.
A common feature in the bills covered is their grounding on the protection of fundamental rights, particularly the rights to human dignity and autonomy, protection of personal data, privacy, non-discrimination, and access to information. Some bills go further as to introduce a new set of AI-related rights to specifically protect users from harmful interactions and impacts created by AI systems.
Brazil’s proposal offers a salient example for this structure, introducing a chapter for the rights of individuals and groups affected by AI systems, regardless of their risk classification. For AI systems in general, Brazil’s proposal includes:
The right to prior information about an interaction with an AI system, in an accessible, free of charge, and understandable format.
The right to privacy and the protection of personal data, following the Lei Geral de Proteção de Dados Pessoais (LGPD) and relevant legislation;
The right to human determination and participation in decisions made by AI systems, taking into account the context, level of risk, and state-of-the-art technological development;
The right to non-discrimination and correction of direct, indirect, unlawful, or abusive discriminatory bias.
Concerning “high-risk” systems or systems that produce “relevant legal effects” to individuals and groups, Brazil’s proposal includes:
The right to an explanation of a decision, recommendation, or prediction made by an AI system;
Subject to commercial and industrial secrecy, the required explanation must contain sufficient information on the operating characteristics; the degree and level of contribution of the AI to decision-making; the data processed and its source; the criteria for decision-making, considering the situation of the individual affected; the mechanisms through which the person can challenge the decision; and the level of human supervision.
The right to challenge and review the decision, recommendation, or prediction made by the system;
The right to human intervention or review of decisions, taking into account the context, risk, and state-of-the-art technological development;
Human intervention will not be required if it is demonstrably impossible or involves a disproportionate effort. The AI operator will implement effective alternative measures to ensure the re-examination of a contested decision.
Brazil’s proposal also includes an obligation that AI operators must provide “clear and accessible information” on the procedures to exercise user rights, and establishes that the defense of individual or collective interests may be brought before the competent authority or the courts.
Mexico’s bill also introduces a chapter on “digital rights”. While these are not as detailed as the Brazilian proposal, the chapter includes innovative ideas, such as the “right to interact and communicate through AI systems”. The proposed set of rights also incorporates the right to access one’s data processed by AI; the right to be treated equally; and the right to data protection. The inclusion of these rights in the AI bill arguably does not make a significant difference, considering most of these rights are already explicitly recognized at a constitutional and legal level. Furthermore, the Mexican bill appears to introduce a catalog of rights and principles, but it lacks specific safeguards or mechanisms for their exercise in the context of AI. However, their inclusion signals the intention of policymakers to govern and regulate AI primarily through a human-rights-based perspective.
2. Most Countries in LatAm Already Have Comprehensive Data Protection Laws, Which Include AI-relevant Provisions
All countries analyzed have adopted comprehensive data protection laws applying to any processing of personal data regardless of the technology involved – some for decades, like Argentina, and some more recently, like Brazil and Chile. Except for Colombia, most data protection laws in these countries include an individual’s right not to be subject to decisions based solely on automated processing. Argentina, Peru, Mexico, and Chile recognize rights related to automated decision-making, prohibiting such activity without human intervention if it produces unwanted legal effects or significantly impacts individuals’ interests, rights, and freedoms, and is intended for profiling. These laws focus on the potential of profiling through automation, and the data protection laws in Peru, Mexico, and Colombia include a specific right prohibiting such activity, while Argentina prohibits profiling by courts or administrative authorities.
In contrast, Brazil’s LGPD recognizes the right to request the review of decisions made solely on automated processing that affect an individual’s interests, including profiling. While the intended purpose may be similar, the right under the Brazilian framework appears to be more limited, where individuals have the right to request review after the profiling occurs, but not necessarily to prevent or oppose this type of processing. Nonetheless, a significant aspect of the right proposed under Brazil’s AI bill is the explicit reference to human intervention in the review, an element absent from the same right under the LGPD.
While AI can enable different and additional outcomes other than profiling, it is noteworthy that most of the data protection laws in these countries already include some level of regulation of AI-powered automated decision-making (ADM) and profiling, whether the AI bills under consideration in the region will ultimately be adopted or not.
3. Risk-Based Regulation is Gaining Traction
All of the reviewed proposals adopt a risk-based approach to regulating AI, seemingly drawing at least some influence from the EU AI Act. These frameworks generally classify AI systems along a gradient of risk, from minimal to unacceptable, and introduce obligations proportional to the level of risk. While the specific definitions and regulatory mechanisms vary, the proposals articulate similar goals of ensuring safe, ethical, and trustworthy development and use of AI.
Brazil’s proposal is one of the most detailed in this respect, mandating a preliminary risk assessment for all systems before their introduction to the market, deployment, or use. The initial assessment must evaluate the system’s purpose, context, and operational impacts to determine its risk level. Similarly, Argentina’s bill requires a pre-market assessment to identify ‘potential biases, risks of discrimination, transparency, and other relevant factors to ensure compliance’.
Notably, most proposals converge in the definition and classification of AI systems with “unacceptable” or “excessive” risk and prohibit their development, commercialization, or deployment. Except for Mexico, whose proposal does not contain an explicit ban, most of the bills expressly prohibit AI systems posing “unacceptable” (Argentina, Chile, Colombia, and Peru) or “excessive” (Brazil) risks. The proposals examined generally consider systems under this classification as being “incompatible with the exercise of fundamental rights” or those posing a “threat to the safety, life, and integrity” of individuals.
For instance, Mexico’s bill defines AI systems with “unacceptable” risk as those that pose a “real, possible, and imminent threat” and involve “cognitive manipulation of behavior” or “classification of individuals based on their behavior and socioeconomic status, or personal characteristics”. Similarly, Colombia’s bill further defines these systems as those “capable of overriding human capacity, designed to control or suppress a person’s physical or mental will, or used to discriminate based on characteristics such as race, gender, orientation, language, political opinion, or disability”.
Brazil’s proposal also prohibits AI systems with “excessive” risk, and sets similar criteria to those found in other proposals in the region and the EU AI Act. In that sense, the proposal refers to AI systems posing “excessive” risk as any with the following purposes:
Manipulating individual or group behavior in a way that causes harm to health, safety, or fundamental rights;
Exploiting vulnerabilities of individuals or groups to influence behavior with harmful consequences;
Profiling individuals’ characteristics or behaviors, including past criminal behavior, to assess the likelihood of committing offenses;
Producing, disseminating, or facilitating material that depicts or promotes sexual exploitation or abuse of minors;
Enabling public authorities to assess or classify individuals through universal scoring systems based on personality or social behavior in a disproportionate or illegitimate manner;
Operating as autonomous weapon systems;
Conducting real-time remote biometric identification in public spaces, unless strictly limited to scenarios of criminal investigation or search of missing persons, among other listed exceptions.
Concerning the classification of “high-risk” systems, some AI bills define them based on certain domains or sectors, while others have a more general or principle-based approach. Generally, high-risk systems are left to be classified by a competent authority, allowing flexibility and discretion from regulators, but subject to specific criteria, such as evaluating a system’s likelihood and severity of creating adverse consequences.
For instance, Brazil’s bill includes at least ten criteria2 for the classification of high-risk systems, such as whether the system unlawfully or abusively produces legal effects that impair access to public or essential services, whether it lacks transparency, explainability, auditability which would impair oversight, or whether it endangers human health –physical, mental or social, either individually or collectively.
Meanwhile, the Peruvian draft regulations include a list of specific uses or sectors where the deployment of any AI system is automatically set to be considered high-risk, such as biometric identification and categorization; security of critical national infrastructure, educational admissions and student evaluations, or employment decisions.3 Under the draft regulations, the classification of “high-risk” systems and their corresponding obligations may be evaluated and reassessed by the competent authority, consistent with the “risk-based security standards principle” under the country’s brief AI law, which mandates the adoption of ‘security safeguards in proportion to a system’s level of risk’.
Colombia’s bill incorporates a mixed approach for high-risk classification. It includes general criteria such as those systems that may “significantly impact fundamental rights”, particularly the rights to privacy, freedom of expression, or access to public information; while also including sensitive or domain-based applications, such as any system “enabling automated decision-making without human oversight that operate in the sectors of healthcare, justice, public security, or financial and social services”.
Mexico’s proposal defines “high-risk” systems as those with the potential to significantly affect public safety, human rights, legality, or legal certainty, but omits additional criteria for their classification. A striking distinction from Mexico’s proposal is that it seems to restrict the use and deployment of these systems to public security entities and the Armed Forces (see Article 48 of the Bill).
The Brazilian bill and Peruvian draft implementing regulations have chapters covering governance measures, describing specific obligations for developers, deployers, and distributors of all AI systems, regardless of their risk level. In addition, most bills include specific obligations for entities operating “high-risk” systems, such as performing comprehensive risk assessments and ethical evaluations; assuring data quality and bias detection; extensive documentation and record-keeping obligations; and guiding users on the intended use, accuracy, and robustness of these systems. Brazil’s bill indicates the competent authority will have discretion to determine cases under which some obligations may be relaxed or waived, according to the context in which the AI operator acts within the value chain of the system.
Under Brazil’s AI bill, entities deploying high-risk systems must also submit an Algorithmic Impact Assessment (AIA) along with the preliminary assessment, which must be conducted following best practices. In certain regulated sectors, the Brazilian authority may require the AIA to be independently verified by an external auditor.
Chile’s proposal outlines mandatory requirements for high-risk systems, which must implement a risk management system grounded in a “continuous and iterative process”. This process must span the entire lifecycle of the system and be subject to periodic review, ensuring failures, malfunctions, and deviations from intended purpose are detected and minimized.
Argentina’s proposal requires all public and private entities that develop or use AI systems to register in a National Registry of Artificial Intelligence Systems, regardless of the level of risk. The registration must include detailed information on the system’s purpose, intended use, field of application, algorithmic structure, and implemented security safeguards. Similarly, Colombia’s bill includes an obligation to conduct fundamental rights impact assessments and create a national registry for high-risk AI systems.
Fewer proposals have specific, targeted provisions for “limited-risk” systems. For instance, Colombia’s bill defines these systems as those that, ‘without posing a significant threat to rights or safety, may have indirect effects or significant consequences on individuals’ personal or economic decisions’. Examples of these systems include AI commonly used for personal assistance, recommendation engines, synthetic content generation, or systems that simulate human interaction. Under Mexico’s proposal, “limited-risk” systems are those that ‘allow users to make informed decisions; require explicit user consent; and allow users to opt out under any circumstances’.
In addition, the Colombian proposal explicitly indicates that AI operators employing these systems must meet transparency obligations, including disclosure of interaction with an AI tool; provide clear information about the system to users; and allow for opt-out or deactivation. Similarly, under the Chilean proposal, a transparency obligation for “limited-risk” AI systems includes informing users exposed to the system in a timely, clear, and intelligible manner that they are interacting with an AI, except in situations where this is “obvious” due to the circumstances and context of use.
Finally, Colombia’s bill describes low-risk systems as those that pose minimal risk to the safety or rights of individuals and thus are subject to general ethical principles, transparency requirements, and best practices. Such systems may include those used for administrative or recreational purposes without ‘direct influence on personal or collective decisions’; systems used by educational institutions and public entities to facilitate activities which do not fall within the scope of any of the other risk levels; and systems used in video games, productivity tools, or simple task automation.
4. Pluri-institutional and Multistakeholder Governance Frameworks are Preferred
A key element shared across the AI legislative proposals reviewed is the establishment of multistakeholder AI governance structures aimed at ensuring responsible oversight, regulatory clarity, and policy coordination.
Notably, Brazil, Chile, and Colombia reflect a shared commitment to institutionalize AI governance frameworks that engage public authorities, sectoral regulators, academia, and civil society. However, they differ in the level of institutional development, the distribution of oversight functions, and the legal authority vested in enforcement bodies. All three countries envision coordination mechanisms that integrate diverse actors to promote coherence in national AI strategies. For instance, Brazil proposes the creation of the National Artificial Intelligence Regulation and Governance System (SIA). This system would be coordinated by the National Data Protection Authority (ANPD) and composed of sectoral regulators, a Permanent Council for AI Cooperation, and a Committee of AI Specialists. The SIA would be tasked with issuing binding rules on transparency obligations, defining general principles for AI development, and supporting sectoral bodies in developing industry-specific regulations.
Chile outlines a governance model centered around a proposed AI Technical Advisory Council, responsible for identifying “high-risk” and “limited-risk” AI systems and advising the Ministry of Science, Technology, Knowledge, and Innovation (MCTIC) on compliance obligations. While the Council’s role is essentially advisory, regulatory oversight and enforcement are delegated to the future Data Protection Authority (DPA), whose establishment is pending under Chile’s recently enacted personal data protection law.
Colombia’s bill designates the Ministry of Science, Technology, and Innovation as the lead authority responsible for regulatory implementation and inter-institutional coordination. The Ministry is tasked with aligning the law’s execution with national AI strategies and developing supporting regulations. Additionally, the bill grants the Superintendency of Industry and Commerce (SIC) specific powers to inspect and enforce AI-related obligations, particularly concerning the processing of personal data, through audits, investigations, and preventive measures.
5. Fostering Responsible Innovation Through Sandboxes, Innovation Ecosystems, and Support for SMEs
Some proposals emphasize the dual objectives of regulatory oversight and the promotion of innovation. A notable commonality is their inclusion of controlled testing environments and regulatory sandboxes for AI systems aimed at facilitating innovation, promoting responsible experimentation, and supporting market access, particularly for startups and small-scale developers.
The bills generally empower competent and sectoral authorities to operate AI regulatory sandboxes, on their initiative or through public-private partnerships. The sandboxes are operated by pre-agreed testing plans, and some offer temporary exemptions from administrative sanctions, while others maintain liability for harms resulting from sandbox-based experimentation.
Proposals in Brazil, Chile, Colombia, and Peru also include relevant provisions to support small-to-medium enterprises (SMEs) and mandate the operation of “innovation ecosystems.” For instance, Brazil’s bill requires sectoral authorities to follow differentiated regulatory criteria for AI systems developed by micro-enterprises, small businesses, and startups, including their market impact, user base, and sectoral relevance.
Similarly, Chile complements its proposed sandbox regime with priority access for smaller companies, capacity-building initiatives, and their representation in the AI Technical Advisory Council. This inclusive approach aims to reduce entry barriers and ensure that small-scale innovators have both voice and access within the AI regulatory ecosystem.
Colombia’s bill includes public funding programs to support AI-related research, technological development, and innovation, with a focus on inclusion and accessibility. Although not explicitly targeted at SMEs, these incentives create indirect benefits for emerging actors and academia-led startups.
Lastly, Peru promotes the development of open-source AI technologies to reduce systemic entry barriers and foster ecosystem efficiency. The regulation also mandates the promotion and financing of AI research and development through national programs, universities, and public administration programs that directly benefit small developers and innovators.
6. The Road Ahead for Responsible AI Governance in LatAm
Latin America is experiencing a wave of proposed legislation to govern AI. While some countries have several proposals under consideration, with some seemingly making more progress towards their adoption than others,4 a comparative review shows they share common elements and objectives. The proposed legislative landscape reveals a shared regional commitment to regulate AI in a manner that is ethical, human-centered, and aligned with fundamental rights. Most of the bills examined lay the groundwork for comprehensive AI governance frameworks based on principles and new AI-related rights.
In addition, all proposals classify AI systems based on their level of risk – with all countries proposing a scaled risk system from minimal or low risk, which goes up to defining systems that pose “unacceptable” or “excessive” risk – and introduce concrete mechanisms and obligations proportional to that classification, with varying but similar requirements to perform risk and impact assessments and other transparency obligations. Most bills also designate an enforcement authority to act in coordination with sectoral agencies to issue further regulations, especially to extend criteria or designate types of systems considered “high-risk”.
Along this normative and institutional framework, most AI bills in Latin America also reflect a growing recognition of the need to balance regulatory oversight with flexibility, reflected in the adoption of controlled testing environments and tailored provisions for startups and SMEs.
Except for Brazil and Peru, much of the legislative activity in the countries covered still remains in early stages. However, the AI bills reviewed offer an insight into how key jurisdictions in the region are considering AI governance, framing it as both a regulatory challenge and an opportunity for inclusive digital development. As these initiatives evolve, key questions around institutional capacity, enforcement, and stakeholder participation will shape how effectively Latin America can build trusted and responsible AI frameworks.
In Mexico, two proposals concerning AI regulation have been introduced, one in the Senate and another in the Chamber of Deputies. Both were put forth by representatives of MORENA, the political party holding a supermajority in Congress. Additionally, the Senate is considering five proposals to amend the Federal Constitution, aiming to grant Congress the authority to legislate on AI matters. Similarly, in Colombia, there are two proposals under the Senate’s consideration and one recently introduced in the Chamber of Deputies. ↩︎
1) The system unlawfully or abusively produces legal effects that impair access to public or essential services; 2) It has a high potential for material or moral harm or for unlawful discriminatory bias; 3) It significantly affects individuals from vulnerable groups; 4) The harm it causes is difficult to reverse; 5) There is a history of damage linked to the system or its context of use; 6) The system lacks transparency, explainability, or auditability, impairing oversight; 7) It poses systemic risks, such as to cybersecurity or safety of vulnerable groups; 8) It presents elevated risks despite mitigation measures, especially in light of anticipated benefits; 9) It endangers integral human health — physical, mental, or social — either individually or collectively; 10) It may negatively affect the development or integrity of children and adolescents. ↩︎
Other uses or sectors included in the high risk category are: access to and prioritization within social programs and emergency services; credit scoring; judicial assistance; Health diagnostics and patient care; criminal profiling, victimization risk analysis, emotional state detection, evidence verification, or criminal investigation by law enforcement. ↩︎
Highlights from FPF’s July 2025 Technologist Roundtable: AI Unlearning and Technical Guardrails
On July 17, 2025, the Future of Privacy Forum (FPF) hosted the second in a series of Technologist Roundtables with the goal of convening an open dialogue on complex technical questions that impact law and policy, and assisting global data protection and privacy policymakers in understanding the relevant technical basics of large language models (LLMs). In this event, we invited a range of academic technical experts and data protection regulators from around the world to explore machine unlearning and technical guardrails.
A. Feder Cooper, Incoming Assistant Professor, Department of Computer Science, Yale University; Postdoctoral Researcher, Microsoft Research; Postdoctoral Affiliate, Stanford University
Ken Ziyu Liu, Ph.D. Candidate, Department of Computer Science, Stanford University; Researcher, Stanford Artificial Intelligence Laboratory (SAIL)
Weijia Shi, Ph.D. Candidate, Department of Computer Science, University of Washington; Visiting Researcher, Allen Institute for Artificial Intelligence
Pratyush Maini, Ph.D. Candidate, Machine Learning Department, Carnegie Mellon University; Founding member of DatologyAI
In emerging literature, the topic of “machine unlearning” and its related technical guardrails concerns the extent to which information can be “removed” or “forgotten” from an LLM or similar generative AI model or from an overall generative AI system. The topic is relevant to a range of policy goals, including complying with individual data subject deletion requests, respecting copyrighted information, building safety and related content protections, and overall performance. Depending on the goal at hand, different technical guardrails and means of operationalizing “unlearning” have different levels of effectiveness.
In this post-event summary, we highlight the key takeaways from three parts of the Roundtable on July 17:
Machine Unlearning: Overview and Policy Considerations
Core “Unlearning” Methods: Exact vs. Approximate
Technical Guardrails and Risk Mitigation
If you have any questions, comments, or wish to discuss any of the topics related to the Roundtable and Post-Event Summary, please do not hesitate to reach out to FPF’s Center for AI at [email protected].
A Price to Pay: U.S. Lawmaker Efforts to Regulate Algorithmic and Data-Driven Pricing
“Algorithmic pricing,” “surveillance pricing,” “dynamic pricing”: in states across the U.S., lawmakers are introducing legislation to regulate a range of practices that use large amounts of data and algorithms to routinely inform decisions about the prices and products offered to consumers. These bills—targeting what this analysis collectively calls “data-driven pricing”—follow the Federal Trade Commission (FTC)’s 2024 announcement that it was conducting a 6(b) investigation to study how firms are engaging in so-called “surveillance pricing,” and the release of preliminary insights from this study in early 2025. With new FTC leadership signalling that continuing the study is not a priority, state lawmakers have stepped in to scrutinize certain pricing schemes involving algorithms and personal data.
The practice of vendors changing their prices based on data about consumers and market conditions is by no means a new phenomenon. In fact, “price discrimination”—the term in economics literature for charging different buyers different prices for largely the same product—has been documented for at least a century, and has likely played a role since the earliest forms of commerce.1 What is unique, however, about more recent forms of data-driven pricing is the granularity of data available, the ability to more easily target individual consumers at scale, and the speed at which prices can be changed. This ecosystem is enabled by the development of tools for collecting large amounts of data, algorithms that analyze this data, and digital and physical infrastructure for easily adjusting prices.
Key takeaways
Data-driven pricing legislation generally focuses on three key elements: the use of algorithms to set prices, the individualization of prices based on personal data, and the context or sector in which the pricing occurs.
Lawmakers are particularly concerned about the potential for data-driven pricing to cause harm to consumers or markets in housing, food establishments, and retail, echoing broader interest in the impact of AI in “high-risk” or “consequential” decisions.
Legislation varies in the scope of pricing practices covered, depending on how key terms are defined. Prohibiting certain practices deemed inappropriate, while maintaining certain practices that consumers find beneficial like loyalty programs or personalized discounts, is a challenge lawmakers are attempting to address.
Beyond legislation, regulators have signalled interest in investigating certain data-driven pricing practices. The Federal Trade Commission, Department of Justice, Department of Transportation, and state Attorneys General have all stated their intentions to enforce against particular instances of algorithmic pricing.
Trends in data-driven pricing legislation
As discussed in the FPF issue brief Data-Driven Pricing: Key Technologies, Business Practices, and Policy Implications, policymakers are generally concerned with a few particular aspects of data-driven pricing strategies: the potential for unfair discrimination, a lack of transparency around pricing practices, the processing and sharing of personal data, and possible anti-competitive behavior or other market distortions. While these policy issues may also be the domain of existing consumer protection, competition, and civil rights laws, lawmakers have made a concerted effort to proactively address them explicitly with new legislation. Crucially, these bills implicate three elements of data-driven pricing practices, raising a series of distinct but related questions for each:
Algorithms: Was an algorithm used to set prices? Are consumers able to understand how the algorithm works? How was the algorithm trained, and how might training data implicate the model’s outputs? What impact does the algorithm have on different market segments and demographic groups, as well as markets overall?
Personal data: Was personal data used to set prices, and are prices personalized to individuals? What kind of personal data is used? Is sensitive data or protected characteristics included? Are inferences made about individuals based on their personal data for the sake of market segmentation?
Context: Is the pricing being implemented in a particular sector, or in regard to particular goods, that might be especially sensitive or consequential? For example, is data-driven pricing being used in the housing market, or in groceries and restaurants?
These elements generally correspond to the different terms used in legislation to refer to data-driven pricing practices. For example, a number of bills use terms such as “algorithmic pricing,” including New York S 3008, an enacted law requiring a disclosure when “personalized algorithmic pricing” is used to set prices,2 and California SB 384, which would prohibit the use of “price-setting algorithms” under certain market conditions. A number of other bills use terms like “surveillance pricing,” such as California AB 446, which would prohibit setting prices based on personal information obtained through “electronic surveillance technology,” and Colorado HB 25-1264, which would make it an unfair trade practice to use “surveillance data” to set individualized prices or worker’s wages. Finally, some bills seek to place limits on the use of “dynamic pricing” in certain circumstances, including Maine LD 1597 and New York A 3437, which would prohibit the practice in the context of groceries and other food establishments. Each of these framings, while distinct, often cover similar kinds of practices.
Given that certain purchases such as housing and food are necessary for survival, the use of data-driven pricing strategies in these contexts is of particular concern to lawmakers. Many states already have laws banning or restricting price gouging, which typically focus on products that are necessities, and specifically during emergencies or disasters. Data-driven pricing bills, on the other hand, are less prescriptive in regards to the amount sellers are allowed to change prices, but apply beyond just emergency situations. While many apply uniformly across the economy, some are focused on particular sectors, including:
Food establishments: eg, Massachusetts S 2515 (applies to grocery stores), Hawaii HB 465 (applies to the sale of food qualifying for federal SNAP and WIC benefits programs)
In addition to bills focused on data-driven pricing, legislation regulating artificial intelligence (AI) and automated decision making more generally often apply specifically to “high-risk AI” and AI used to make “consequential decisions,” including educational opportunities, employment, finance or lending, healthcare, housing, insurance, and other critical services. The use of a pricing algorithm in one of these contexts may therefore trigger the requirements of certain AI regulations. For example, the Colorado AI Act defines “consequential decision” to mean “a decision that has a material legal or similarly significant effect on the provision or denial to any consumer of, or the cost or terms of…” the aforementioned categories.
Because certain data-driven pricing strategies are widespread and appeal to many consumers, there is some concern—particularly among retailers and advertisers—that overly-broad restrictions could actually end up harming consumers and businesses alike. For example, widely popular and commonplace happy hours could, under certain definitions, be considered “dynamic pricing.” As such, data-driven pricing legislation often contains exemptions, which generally fall into a few categories:
General discounts: Deals that are available to the general public, such as coupons, sales, or bona fide loyalty programs (eg, California SB 259).
Cost-based price differentials: Pricing differences or changes due to legitimate disparities in input or production costs across areas (eg, Georgia SB 164).
Insurers or financial institutions: Highly-regulated entities that may engage in data-driven pricing strategies in compliance with other existing laws (eg, Illinois SB 2255).
Key remaining questions
A number of policy and legal issues will be important to keep an eye on as policymakers continue to learn about the range of existing data-driven pricing strategies and consider potential regulatory approaches.
The importance of definitions
As policymakers attempt to articulate the contours of what they consider to be fair pricing strategies, the definitions they adopt play a major role in the scope of practices that are allowed. Crafting rules that prohibit certain undesirable practices without eliminating others that consumers and businesses rely on and enjoy is challenging, requiring policymakers to identify what specific acts or market conditions they’re trying to prevent. For example, Maine LD 1597, which is intended to stop the use of most dynamic pricing by food establishments, includes an incredibly broad definition of “dynamic pricing”:
“Dynamic pricing” means the practice of causing a price for a good or a product to fluctuate based upon demand, the weather, consumer data or other similar factors including an artificial intelligence-enabled pricing adjustment.
While the bill would exempt discounts, time-limited special prices such as happy hours, and goods that “traditionally [have] been priced based upon market conditions, such as seafood,” prohibiting price changes based on “demand” could undermine a fundamental principle of the market economy. Even with exceptions that carve out sales and other discounts—and not all bills contain such exemptions—legislation might still inadvertently capture other accepted practices such as specials aligned with seasonal changes, bulk purchase discounts, deals on goods nearing expiration, or promotions to clear inventory.
Lawmakers must also consider how any new definitions interact with definitions in existing law. For example, an early version of California AB 446, which would prohibit “surveillance pricing” based on personally identifiable information, included “deidentified or aggregated consumer information” within the definition of “personally identifiable information.” However, deidentified and aggregated information is not considered “personal information” as defined by the California Consumer Privacy Act (CCPA). In later versions, the bill authors aligned the definition in AB 446 with the text of the CCPA.
The role of AI
In line with policymakers’ increased focus on AI, and a shift towards industry use of algorithms in setting prices, a significant amount of data-driven pricing legislation applies explicitly to algorithmic pricing. Some bills, such as California SB 52 and California SB 384, are intended to address potential algorithmically-driven anticompetitive practices, while many others are geared towards protecting consumers from discriminatory practices. Though consumer protection may be the goal, some bills focus not on preventing specific impacts, but on eliminating the use of AI in pricing at all, at least in real time. For example, Minnesota HF 2452 / SF 3098 states:
A person is prohibited from using artificial intelligence to adjust, fix, or control product prices in real time based on market demands, competitor prices, inventory levels, customer behavior, or other factors a person may use to determine or set prices for a product.
This bill would prohibit all use of AI for price setting, even when based on typical product pricing data and applied equally to all consumers. Such a ban would have a significant impact on the practice of surge pricing, and any sector that is highly reactive to market fluctuations. On the other hand, other bills focus on the use of personal data—including sensitive data like biometrics—to set prices that are personalized to each consumer. For example, Colorado HB 25-1264 would prohibit the practice of “surveillance-based price discrimination,” defined as:
Using an automated decision system to inform individualized prices based on surveillance data regarding a consumer.
…
“Surveillance data” means data obtained through observation, inference, or surveillance of a consumer or worker that is related to personal characteristics, behaviors, or biometrics of the individual or a group, band, class, or tier in which the individual belongs.
These bills are concerned not necessarily with the use of AI in pricing per se, but how the use of AI in conjunction with personal data could have a detrimental effect on individual consumers.
The impact on consumers
While data-driven pricing legislation is generally intended to protect consumers, some approaches may unintentionally block practices that consumers enjoy and rely on. There is a large delta between common and beneficial price-adjusting practices like sales on one hand, and exploitative practices like price gouging on the other, and writing a law that draws the proper cut-off point between the two is difficult. For example, Illinois SB 2255 contains the following prohibition:
A person shall not use surveillance data as part of an automated decision system to inform the individualized price assessed to a consumer for goods or services.
The bill would exempt persons assessing price based on the cost of providing a good or service, insurers in compliance with state law, and credit-extending entities in compliance with the Fair Credit Reporting Act. However, it would not exempt bona fide loyalty programs, a popular consumer benefit that is excluded from other similar legislation (such as the enacted New York S 3008, which carves out deals provided under certain “subscription-based agreements”). While lawmakers likely intended just to prevent exploitative pricing schemes that disempower consumers, they may inadvertently restrict some favorable practices as well. As a result, if statutes aren’t clear, some businesses may forgo offering discounts for fear of noncompliance.
Legal challenges to legislation
When New York S 3008 went into effect on July 8, 2025, the National Retail Federation filed a lawsuit to block the law, alleging that it would violate the First Amendment by including the following requirement, amounting to compelled speech:
Any entity that sets the price of a specific good or service using personalized algorithmic pricing … shall include with such statement, display, image, offer or announcement, a clear and conspicuous disclosure that states: “THIS PRICE WAS SET BY AN ALGORITHM USING YOUR PERSONAL DATA”.
The New York Office of the Attorney General, in response, said it would pause enforcement until 30 days after the judge in the case makes a decision on whether to grant a preliminary injunction. Other data-driven pricing bills would not face this challenge, as they don’t contain specific language requirements, instead focusing on prohibiting certain practices.
Beyond legislation
Regulators have also been scrutinizing certain data-driven pricing strategies, particularly for potentially anticompetitive conduct. While the FTC has seemingly deprioritized the 6(b) study of “surveillance pricing” it announced in July 2024—cancelling public comments after releasing preliminary insights from the report in January 2025—it could still take up actions regarding algorithmic pricing in the future under its competition authority. In fact, the FTC’s new leadership has not retracted a joint statement the Commission made in 2024 along with the Department of Justice (DOJ), European Commission, and UK Competition and Markets Authority, which affirmed “a commitment to protecting competition across the artificial intelligence (AI) ecosystem.” The FTC, along with 17 state attorneys general (AGs), also still has a pending lawsuit against Amazon, accusing the company of using algorithms to deter other sellers from offering lower prices.
Even if the FTC refrains from regulating data-driven pricing, other regulators may be interested in addressing the issue. In particular, in 2024 the DOJ, alongside eight state AGs, used its antitrust authority to sue the property management software company RealPage for allegedly using an algorithmic pricing model and nonpublic housing rental data to collude with other landlords. Anticompetitive uses of algorithmic pricing tools is also a DOJ priority under new leadership, with the agency filing a statement of interest regarding the “application of the antitrust laws to claims alleging algorithmic collusion and information exchange” in a March 2025 case, and the agency’s Antitrust Division head promising an increase in probes of algorithmic pricing. Additionally, in response to reports claiming that Delta Air Lines planned to institute algorithmic pricing for tickets—and a letter to the company from Senators Gallego (D-AZ), Blumenthal (D-CT), and Warner (D-VA)—the Department of Transportation Secretary signalled that the agency would investigate such practices.
Conclusion
Policymakers are turning their attention towards certain data-driven pricing strategies, concerned about the impact—on consumers and markets—of practices that use large amounts of data and algorithms to set and adjust prices. Focused on practices such as “algorithmic,” “surveillance,” and “dynamic” pricing, these bills generally address pricing that involves the use of personal data, the deployment of AI, and/or frequent changes, particularly in critical sectors like food and housing. As access to consumer data grows, and algorithms are implemented in more domains, industry may increasingly rely on data-driven pricing tools to set prices. As such, legislators and regulators will likely continue to scrutinize their potential harmful impacts.
While some forms of price discrimination are illegal, many are not. The term “discrimination” as used in this context is distinct from how it’s used in the context of civil rights. ↩︎
The New York Attorney General’s office said, as of July 14, 2025, that it would pause enforcement of the law while a federal judge decides on a motion for preliminary injunction, following a lawsuit brought by the National Retail Federation. ↩︎
The “Neural Data” Goldilocks Problem: Defining “Neural Data” in U.S. State Privacy Laws
Co-authored by Chris Victory, FPF Intern
As of halfway through 2025, four U.S. states have enacted laws regarding “neural data” or “neurotechnology data.” These laws, all of which amend existing state privacy laws, signify growing lawmaker interest in regulating what’s being considered a distinct, particularly sensitive kind of data: information about people’s thoughts, feelings, and mental activity. Created in response to the burgeoning neurotechnology industry, neural data laws in the U.S. seek to extend existing protections for the most sensitive of personal data to the newly-conceived legal category of “neural data.”
Each of these laws defines “neural data” in related but distinct ways, raising a number of important questions: just how broad should this new data type be? How can lawmakers draw clear boundaries for a data type that, in theory, could apply to anything that reveals an individual’s mental activity? Is mental privacy actually separate from all other kinds of privacy? This blog post explores how Montana, California, Connecticut, and Colorado define “neural data,” how these varying definitions might apply to real-world scenarios, and some challenges with regulating at the level of neural data.
“Neural” and “neurotechnology” data definitions vary by state.
While just four states (Montana, California, Connecticut, and Colorado) currently have neural data laws on the books, legislation has rapidly expanded over the past couple years. Following the emergence of sophisticated deep learning models and other AI systems, which gave a significant boost to the neurotechnology industry, media and policymaker attention turned to the nascent technology’s privacy, safety, and other ethical considerations. Proposed regulation—both in the U.S. and globally—varies in its approach to neural data, with some strategies creating new “neurorights” or mandating entities minimize the neural data they collect or process.
In the U.S., however, laws have coalesced around an approach in which covered entities must treat neural data as “sensitive data” or other data with heightened protections under existing privacy law, above and beyond the protections granted by virtue of being personal information. The requirements that attach to neural data by virtue of being “sensitive” vary by underlying statute, as illustrated in the accompanying comparison chart. In fact, even the way that “neural data” is defined varies by law, placing different data types within scope depending on the state. The following definitions are organized roughly from the broadest conception of neural data to the narrowest.
California
Generally speaking, the broadest conception of “neural data” in the U.S. laws is California SB 1223, which amends the state’s existing consumer privacy law, the California Consumer Privacy Act (CCPA), to clarify that “sensitive personal information” includes “neural data.” The law, which went into effect January 1, 2025, defines “neural data” as:
Information that is generated by measuring the activity of a consumer’s central or peripheral nervous system, and that is not inferred from nonneural information.
Notably, however, the CCPA as amended by the California Privacy Rights Act (CPRA) treats “sensitive personal information” no differently than personal information except when it’s used for “the purpose of inferring characteristics about a consumer”—in which case it is subject to heightened protections. As such, the stricter standard for sensitive information will only apply when neural data is collected or processed for making inferences.
Montana
Montana SB 163 takes a slightly different approach than the other laws in two ways: one, it applies to “neurotechnology data,” an even broader category of data that includes the measurement of neural activity; and two, it amends Montana’s Genetic Information Privacy Act (GIPA) rather than a comprehensive consumer privacy law. The law, which goes into effect October 1, 2025, will define “neurotechnology data” as:
Information that is captured by neurotechologies, is generated by measuring the activity of an individual’s central or peripheral nervous systems, or is data associated with neural activity, which means the activity of neurons or glial cells in the central or peripheral nervous system, and that is not nonneural information. The term does not include nonneural information, which means information about the downstream physical effects of neural activity, including but not limited to pupil dilation, motor activity, and breathing rate.
The law will define “neurotechnology” as:
Devices capable of recording, interpreting, or altering the response of an individual’s central or peripheral nervous system to its internal or external environment and includes mental augmentation, which means improving human cognition and behavior through direct recording or manipulation of neural activity by neurotechnology.
However, the law’s affirmative requirements will only apply to “entities” handling genetic or neurotechnology data, with “entities” defined narrowly—as in the original GIPA—as:
…a partnership, corporation, association, or public or private organization of any character that: (a) offers consumer genetic testing products or services directly to a consumer; or (b) collects, uses, or analyzes genetic data.
While the lawmakers may not have intended to limit its application to consumer genetic testing companies, and may have inadvertently carried over GIPA’s definition of “entities,” the text of the statute may significantly narrow the companies subject to it.
Connecticut
Similarly, Connecticut SB 1295, most of which goes into effect July 1, 2026, will amend the Connecticut Data Privacy Act to clarify that “sensitive data” includes “neural data,” defined as:
Any information that is generated by measuring the activity of an individual’s central nervous system.
In contrast to other definitions, the Connecticut law will apply only to central nervous system activity, rather than central and peripheral nervous system activity. However, it also does not explicitly exempt inferred data or nonneural information as California and Montana do, respectively.
Colorado
Colorado HB 24-1058, which went into effect August 7, 2024, amends the Colorado Privacy Act to clarify that “sensitive data” includes “biological data,” which itself includes “neural data.” “Biological data” is defined as:
Data generated by the technological processing, measurement, or analysis of an individual’s biological, genetic, biochemical, physiological, or neural properties, compositions, or activities or of an individual’s body or bodily functions, which data is used or intended to be used, singly or in combination with other personal data, for identification purposes.
The law defines “neural data” as:
Information that is generated by the measurement of the activity of an individual’s central or peripheral nervous systems and that can be processed by or with the assistance of a device.
Notably, “biological data” only applies to such data when used or intended to be used for identification, significantly narrowing the potential scope.
* While only Montana explicitly covers data captured by neurotechnologies, and excludes nonneural information, the other laws may implicitly do so as well.
The Goldilocks Problem: The nature of “neural data” makes it challenging to get the definition just right.
Given that each state law defines neural data differently, there may be significant variance in what kinds of data are covered. Generally, these differences cut across three elements:
Central vs. peripheral nervous system data: Does the law cover data from both the central and peripheral nervous system, or just the central nervous system?
Treatment of inferred and nonneural data: Does the law exclude neural data that is inferred from nonneural activity?
Identification: Does the law exclude neural data that is not used, or intended to be used, for the purpose of identification?
Central vs. peripheral nervous system data
The nervous system comprises the central nervous system (CNS) and the peripheral nervous system (PNS). The CNS—made up of the brain and spinal cord—carries out higher-level functions including thinking, emotions, and coordinating motor activity. The PNS—the network of nerves that connects the CNS to the rest of the body—receives signals from the CNS and transmits this information to the rest of the body instructing it on how to function, and transfers sensory information back to the CNS in a cyclical process. Some of this activity is conscious and deliberate on the part of the individual (voluntary nervous system), while some involves unconscious, involuntary functions like digestion and heart rate (autonomic nervous system).
What this means practically is that the nervous system is involved in just about every human bodily function. Some of this data is undoubtedly particularly sensitive, as it can reveal information about an individual’s health, sexuality, emotions, identity, and more. It may also provide insight into an individual’s “thoughts,” either by accessing brain activity directly or by measuring other bodily data that in effect reveals what the individual is thinking (eg, increased heart and breathing rate at a particular time can reveal stress or arousal). It also means that an incredibly broad swath of data could be considered neural data: the movement of a computer mouse or use of a smartwatch may technically constitute, under certain definitions, neural data.
As such, there is a significant difference between laws that cover both CNS and PNS data, and those that only cover CNS data. Connecticut SB 1295 is the lone current law that applies solely to CNS data, which narrows its scope considerably and likely only covers data collected from tools such as brain-computer interfaces (BCIs), electroencephalogram (EEGs), and other similar devices. However, other data types that would be excluded by virtue of not relating to the CNS could, in theory, provide the same or similar information. For example, signals from the PNS—such as pupillometry (pupil dilation), respiration (breathing patterns), and heart rate—could also indicate the nervous system’s response to stimuli, despite not technically being a direct measurement of the CNS.
Treatment of inferred and nonneural data
Defining “neural data” in a way that covers particular data of concern without being overinclusive is challenging, and lawmakers have added carveouts in an attempt to make their legislation more workable. However, focusing regulation on the nervous system in the first place raises a few potential issues. First, it reinforces neuroessentialism, the idea that the nervous system and neural data are unique and separate from other types of sensitive data; as well as neurohype, the inflation or exaggeration of neurotechnologies’ capabilities. There is not currently—and may never be, as such—a technology for “reading a person’s mind.” What may be possible are tools that measure neural activity to provide clues about what an individual might be thinking or feeling, much the same as measuring their other bodily functions, or even just gaining access to their browsing history. This doesn’t make the data less sensitive, but challenges the idea that “neural data” itself—whether referring to the central, peripheral, or both nervous systems—is the most appropriate level for regulation.
This creates one of two problems for lawmakers. On one hand, defining “neural data” too broadly could create a scenario in which all bodily data is covered. Typing on a keyboard involves neural data, as the central nervous system sends signals through the peripheral nervous system to the hands in order to type. Yet, regulating all data related to typing as sensitive neural data could be unworkable. On the other hand, defining “neural data” too narrowly could result in regulations that don’t actually provide the protections that lawmakers are seeking. For example, if legislation only applies to neural data that is used for identification purposes, it may cover very few situations, as this is not a way that neural data is typically used. Similarly, only covering CNS data, rather than both CNS and PNS data, may be difficult to implement because it’s not clear that it’s possible to truly separate the data from these two systems, as they are interlinked.
One way lawmakers seek to get around the first problem is by narrowing the scope, clarifying that the legislation doesn’t apply to “nonneural information” such as downstream physical bodily effects, or neural data that is “inferred from nonneural information.” For example, Montana SB 163 excludes “nonneural information” such as pupil dilation, motor activity, and breathing rate. However, if the concern is that certain information is particularly sensitive and should be protected (eg, data potentially revealing an individual’s thoughts or feelings), then scoping out this information just because it’s obtained in a different way doesn’t address the underlying issue. For example, if data about an individual’s heart rate, breathing, perspiration, and speech pattern is used to infer their emotional state, this is functionally no different—and potentially even more revealing—than data collected “directly” from the nervous system. Similarly, California SB 1223 carves out data that is “inferred from nonneural information,” leaving open the possibility for the same kind of information to be inferred through other bodily data.
Identification
Another way lawmakers, specifically in Colorado, have sought to avoid an unmanageably broad conception of neural data is to only cover such data when used for identification. Colorado HB 24-1058, which regulates “biological data”—of which “neural data” is one component—only applies when the data “is used or intended to be used, singly or in combination with other personal data, for identification purposes.” Given that neural data, at least currently, is not used for identification, it’s not clear that such a definition would cover many, if any, instances of consumer neural data.
Conclusion
Each of the four U.S. states currently regulating “neural data” defines the term differently, varying around elements such as the treatment of central and peripheral nervous system data, exclusions for inferred or nonneural data, and the use of neural data for identification. As a result, the scope of data covered under each law differs depending on how “neural data” is defined. At the same time, attempting to define “neural data” reveals more fundamental challenges with regulating at the level of nervous system activity. The nervous system is involved in nearly all bodily functions, from innocuous movements to sensitive activities. Legislating around all nervous system activity may render physical technologies unworkable, while certain carveouts may, conversely, scope out information that lawmakers want to protect. While many are concerned about technologies that can “read minds,” such a tool does not currently exist per se, and in many cases nonneural data can reveal the same information. As such, focusing too narrowly on “thoughts” or “brain activity” could exclude some of the most sensitive and intimate personal characteristics that people want to protect. In finding the right balance, lawmakers should be clear about what potential uses or outcomes on which they would like to focus.
FPF at PDP Week 2025: Generative AI, Digital Trust, and the Future of Cross-Border Data Transfers in APAC
Authors: Darren Ang Wei Cheng and James Jerin Akash (FPF APAC Interns)
From July 7 to 10, 2025, the Future of Privacy Forum (FPF)’s Asia-Pacific (APAC) office was actively engaged in Singapore’s Personal Data Protection Week 2025 (PDP Week) – a week of events hosted by the Personal Data Protection Commission of Singapore (PDPC) at the Marina Bay Sands Expo and Convention Centre in Singapore.
Alongside the PDPC’s events, PDP Week also included a two-day industry conference organized by the International Association of Privacy Professionals (IAPP) – the IAPP Asia Privacy Forum and AI Governance Global.
This blog post presents key takeaways from the wide range of events and engagements that FPF APAC led and participated in throughout the week. Key themes that emerged from the week’s discussions included:
AI governance has moved beyond principles to practice, policies, and passed laws: Organizations are now focused on the practical steps to be taken for developing and deploying AI responsibly. This requires a cross-functional approach within organizations and thorough due diligence when procuring third-party AI solutions.
Digital trust has become a greater imperative: As AI systems and other digital technologies become more complex, building digital trust and ensuring that technology aligns with consumer and societal expectations are critical to maximizing the potential benefits.
The future of the digital economy will be shaped by the trajectory of cross-border data transfers: There is a tension, both in the APAC region and globally, between the rise of restrictive data transfer requirements, and the fact that data transfers are essential for the digital economy and the development of high-quality AI systems.
Technical and legal solutions for privacy are gaining ground: In response to the complex landscape of data transfer rules, stakeholders are actively exploring practical solutions such as Privacy Enhancing Technologies (PETs), internationally-recognized certifications, or mechanisms such as the ASEAN Model Contractual Clauses (MCCs).
In the paragraphs below, we elaborate on some of these themes, as well as other interesting observations that came up over the course of FPF’s involvement in PDP Week.
1. FPF’s and IMDA’s co-hosted workshop shared practical perspectives for companies navigating the waters of generative AI governance.
On Monday, July 7, 2025, FPF joined the Infocomm Media Development Authority of Singapore (IMDA) in hosting a workshop for Singapore’s data protection community, titled “AI, AI, Captain!: Steering your organisation in the waters of Gen AI by IMDA and FPF.” The highly-anticipated event provided participants with practical knowledge about AI governance at the organizational level.
The event was hosted by Josh Lee Kok Thong, Managing Director of FPF APAC, and was attended by around 200 representatives from industry, including data protection officers (DPOs) and chief technology officers (CTOs). FPF’s segment of the workshop had two parts: an informational segment featuring presentations from FPF and IMDA, followed by a multi-stakeholder, practice-focused panel discussion.
FPF at “AI, AI, Captain! – Steering your organisation in the waters of Gen AI by IMDA and FPF”, July 8, 2025.
1.1 AI governance in APAC is neither unguided nor ungoverned, as policymakers are actively working to develop both soft and hard regulations for AI and to clarify how existing data protection laws apply to its use.
Josh presented on global AI governance, highlighting the rapid legislative changes in the APAC region over the past six months, and comparing developments in South Korea, Japan, and Vietnam with those in the EU, US, and Latin America. He then discussed how data protection laws – especially provisions on consent, data subject rights, and breach management – impact AI governance and how data protection regulators in Japan, South Korea, and Hong Kong (among others) have provided guidance on this. Josh’s presentation was followed by one from Darshini Ramiah, Senior Manager of AI Governance and Safety at IMDA. Darshini provided an overview of Singapore’s approach to AI governance, which is built on three key pillars:
Creating practical tools, such as the AI Verify toolkit and Project Moonshot, which enable benchmarking and red teaming of both traditional AI systems and large language models (LLMs) respectively;
Engaging closely with international partners, such as through the ASEAN Working Group on AI Governance and the publication of the AI Playbook for Small States under the Digital Forum of Small States; and
Collaborating with industry in the development of principles and tools around AI governance.
FPF presenting at “AI AI Captain – Steering your organisation in the waters of Gen AI by IMDA and FPF”, July 8, 2025.
1.2 FPF moderated a panel session that focused on key aspects of AI governance and featured industry experts and regulators.
The panel session of the workshop, moderated by Josh, included the following experts:
Darshini Ramiah, Senior Manager, AI Governance and Safety at IMDA;
Derek Ho, Deputy Chief Privacy, AI and Data Responsibility Officerat Mastercard; and
Patrick Chua, Senior Principal Digital Strategist at Singapore Airlines (SIA).
The experts discussed AI governance from both an industry and regulatory perspective.
The panelists highlighted that AI governance is cross-functional and requires collaborative effort from the various teams in the organization to be successful.
One of the panelists suggested looking at the Principles, People, Process and Technology (“3Ps and a T”) when considering AI governance. The panelists agreed on the importance of clearly defining values that serve as a “North Star” to guide their organization’s cross-functional AI governance efforts and to build strong support from senior management for related initiatives.
For small and medium enterprises (SMEs), the panelists emphasized that a structured but scalable governance model could help SMEs to manage AI risk effectively. SMEs can start by referring to existing resources like IMDA’s guidelines, such as the Model AI Governance Framework.
Recognizing that many organizations in Singapore will be procuring ready-made AI solutions rather than developing their own models in-house, panelists highlighted the need for strong due diligence. This includes examining the model cards which disclose the model’s key metrics, adopting contractual safeguards from third party vendors, and deploying the technology in stages to further limit risk.
Singapore is also working to standardize AI transparency for industry. IMDA is exploring several areas, including the introduction of standardized disclosure formats for AI model developers, such as standardized model cards.
FPF moderating the panel session at “AI AI Captain – Steering your organisation in the waters of Gen AI by IMDA and FPF”, July 8, 2025.
2. FPF facilitated deep conversations at PDPC’s PETs Summit, including on the use of PETs in cross-border data transfers and within SMEs.
2.1 FPF moderated a fireside chat on PETs use cases during the opening Plenary Session.
On Tuesday, July 8, 2025 FPF APAC participated in a day-long PETs Summit, organized by the PDPC and IMDA. During the opening plenary session, Josh moderated a fireside chat with Fabio Bruno, Assistant Director of Applied Innovation at INTERPOL, titled “Solving Big Problems with PETs.”Following panels that covered use cases for PETs and policies that could increase their adoption, this fireside chat looked at how PETs could present fresh solutions to long-standing data protection issues (such as cross-border data transfers).
In this regard, Fabio shared how law enforcement bodies around the world have been exploring PETs to streamline investigations. He highlighted ongoing exploration of certain PETs, such as zero-knowledge proofs (a cryptographic method that allows one party to prove to another party that a particular piece of information is true without revealing any additional information beyond the validity of the claim) and homomorphic encryption (a family of encryption schemes allowing for computations to be performed directly on encrypted data without having to first decrypt it). In a law enforcement context, these PETs enable preliminary validation that can help to reduce delays and lower the cost of investigations, while also helping to protect individuals’ privacy.
Notwithstanding the potential of PETs for cross-border data transfers (even for commercial, non-law enforcement contexts), challenges exist. These include: (1) enhancing and harmonizing the understanding and acceptability of PETs among data protection regulators globally; and (2) obtaining higher management support to invest in PETs. Nevertheless, the fireside chat concluded with optimism about the prospect of the greater use of PETs for data transfers, and left the audience with plenty of food for thought.
FPF moderating the fireside chat at PETs Summit Plenary Session, July 8, 2025
2.2 FPF Members facilitated an engaging PETs Deep Dive Session that explored business use cases for PETs.
After the plenary session, FPF APAC teammates Dominic Paulger, Sakshi Shivhare, and Bilal Mohamed facilitated a practical workshop, titled the “PETs Deep Dive Session” that was organized by the IMDA. Drawing on the IMDA’s draft PETs Adoption Guide, the workshop aimed to help Chief Data Officers, DPOs, and AI and data product teams understand which PETs best fit their business use cases.
FPF APAC Team at PETs Summit, July 8, 2025
3. On Wednesday, FPF joined a discussion at IAPP Asia Privacy Forum on how regulators and major tech companies in the APAC region are fostering “digital trust” in AI by aligning technology with societal expectations.
On Wednesday, July 9, 2025, FPF APAC participated in an IAPP Asia Privacy Forum panel titled “Building Digital Trust in AI: Perspectives from APAC.”Josh joined Lanah Kammourieh Donnelly, Global Head of Privacy Policy, at Google, and Lee Wan Sie, Cluster Director for AI Governance and Safety at the IMDA for a panel moderated by Justin B. Weiss, Senior Director at Crowell Global Advisors.
A key theme from the panel was that, given the opacity of many digital technologies, the concept of digital trust is essential to ensure that these technologies work in ways that protect important societal interests. Accordingly, the panel discussed strategies that could foster digital trust.
Wan Sie provided the regulator’s perspective and acknowledged that given the rapid pace of AI development, regulation would always be “playing catch-up.” Thus, instead of implementing a horizontal AI law, she shared how Singapore is focusing on making the industry more capable of using AI responsibly. Wan Sie pointed to AI Verify, Singapore’s AI governance testing framework and toolkit, and the IMDA’s new Global AI Assurance Sandbox, as mechanisms that help organizations ensure their AI systems could demonstrate greater trustworthiness to users.
Josh focused on trends from across the APAC region, sharing how regulators in Japan and South Korea have been actively considering amendments to their data protection laws to expand the legal bases for processing personal data, in order to facilitate greater availability of data for training high-quality AI systems.
Lanah highlighted Google’s approach of developing AI responsibly in accordance with certain core privacy values, such as those in the Fair Information Practice Principles (FIPPs). For example, she shared how Google is actively researching technological solutions like training its models on synthetic data instead of using publicly-available datasets from the Internet which may contain large amounts of personal data.
Overall, the panel noted that APAC is taking its own distinct approach to AI governance – one in which industry and regulators collaborate actively to ensure principled development of technology.
FPF and the “Building Digital Trust in AI: Perspectives from APAC” panel at IAPP, 9 July 2025.
4. On Thursday, FPF staff moderated two panels at IAPP AI Governance Global on cross-border data transfers and regulatory developments in Australia
4.1 While cross-border data transfers are fragmented and restrictive, there is cautious optimism that APAC will pursue interoperability.
On Thursday, July 10, 2025, FPF organised a panel titled “Shifting Sands: The Outlook for Cross Border Data Transfers in APAC” which featured Emily Hancock, Vice President and Chief Privacy Officer at Cloudflare, Arianne Jimenez,Head of Privacy and Data Policy and Engagement for APAC at Meta and Zee Kin Yeong,Chief Executive of the Singapore Academy of Law and FPF Senior Fellow. Moderated by Josh, the panel discussed evolving regulatory frameworks for cross-border data transfers in APAC.
The panel first observed that the landscape for cross-border data transfers across APAC remains fragmented. Emily elaborated that restrictions on data transfer were a global phenomenon and attributable to how data is increasingly viewed as a national security matter, making governments less willing to lower restrictions and pursue interoperability.
Despite this challenging landscape, the panel members were cautiously optimistic that transfer restrictions could be managed effectively. Zee Kin highlighted how the increasing integration of economies through supranational organizations like ASEAN is driving a push in APAC towards recognizing more business-friendly data transfer mechanisms, such as the ASEAN MCCs. He also noted that regulators often relax restrictions once local businesses start to expand operations overseas and need to transfer data across borders.
Arianne suggested that businesses communicate to regulators the challenges they face with restrictive data transfer frameworks. She acknowledged that SMEs are often not as well-resourced as multi-national corporations (MNCs), and thus faced difficulties in navigating the complex patchwork of regulations across the region. She explained that since regulators in APAC are generally open to consultation, businesses should take the opportunity to advocate for more interoperability.
The panel concluded by highlighting the importance of data transfers to AI development. Cross-border data transfers are crucial to fostering diverse datasets, accessing advanced computing infrastructure, combating global cyber-threats by enabling worldwide threat sharing, and reducing the environmental impact by limiting the need for additional data centers. Overall, the panel expressed hope that despite the legal fragmentation and complicated state of play, the clear benefits of cross-border data transfers would encourage jurisdictions to pursue greater interoperability.
FPF and the “Shifting Sands: The Outlook for Cross Border Data Transfers in APAC” panel at IAPP July 10, 2025.
4.2 With updates to Australia’s Privacy Act, privacy is non-negotiable, and businesses can benefit from improving their privacy compliance processes and systems ahead of increased enforcement.
FPF’s APAC Deputy Director Dominic Paulger moderated a panel titled“Navigating the Impact of Australia’s Privacy Act Amendments in the Asia-Pacific.” The panelists included Dora Amoah, Global Privacy Office Lead at the Boeing Company, Rachel Baker, Senior Corporate Counsel for Privacy, JAPAC, at Salesforce, and Annelies Moens, the former Managing Director of Privcore. The panel discussed the enactment of the Privacy and Other Legislation Amendment Bill 2024 following a multiyear review of Australia’s Privacy Act, and the potential impact of these reforms on businesses.
Annelies shared an overview of the reforms, including:
new transparency requirements for automated decision-making (ADM);
revised cross-border data transfer mechanisms;
new enforcement powers for the Office of the Australian Information Commissioner (OAIC), and
a new statutory tort for serious invasions of privacy.
She mentioned that more changes could be coming, but some proposals – such as removing the small business exception – were facing resistance in Australia. However, irrespective of how the law develops, businesses can expect enforcement to increase.
The industry panelists shared their insights and experiences complying with the new amendments. Dora explained that despite the increased litigation risk from the new statutory tort for serious invasions of privacy, the threshold for liability was rather high as the tort required intent. She also noted that companies could avoid liability through implementing proper processes that prevent intentional or reckless misconduct.
Rachel noted that the Privacy Act’s new ADM provisions would improve consumer rights in Australia. She observed how Australians have been facing serious privacy intrusions that have drawn the OAIC’s attention, such as the Cambridge Analytica scandal, and the mis-use of facial recognition technology. She considered that since data subjects in Australia are increasingly expecting more rights, such as the right to deletion, businesses should go beyond compliance and actively adopt best practices.
Overall, the panel expressed the view that with this new reality, the role of the privacy professional in Australia, much like the rest of the world, is evolving to not just interpret and comply with the law but also to build robust systems through privacy by design.
FPF and the panelists of “Navigating the Impact of Australia’s Privacy Act Amendments in the Asia-Pacific” at IAPP July 10, 2025.
5. FPF organized exclusive side events to foster deeper engagements with key stakeholders.
A key theme of FPF’s annual PDP Week experience has always been about bringing our global FPF community – members, fellows, and friends – together for deep and meaningful conversations about the latest developments. This year, FPF APAC organized two events for its members: a Privacy Leaders’ Luncheon (an annual staple), and for the first time, an India Luncheon co-organized alongside Khaitan & Co.
5.1 On July 8, 2025, FPF hosted an invite-only Privacy Leaders’ Luncheon.
This closed-door event provided a platform for senior stakeholders of FPF APAC to discuss pressing challenges at the intersection of AI and privacy, with a particular focus on the APAC region. During the session, the attendees discussed key topics such as the emerging developments in data protection laws, AI governance, and children’s privacy.
FPF’s Privacy Leaders Luncheon, July 8, 2025.
5.2 On July 10, FPF co-hosted an India Roundtable Luncheon with Khaitan & Co.
FPF APAC also collaborated with an Indian law firm, Khaitan & Co, to co-host a lunch roundtable focusing on pressing challenges in India, such as the development of implementing rules for the Digital Personal Data Protection Act, 2023 (DPDPA). The event brought together experts from both India and Singapore for fruitful discussions around the DPDPA and the draft Digital Personal Data Protection Rules. FPF APAC is grateful to have partnered with Khaitan & Co for the Luncheon, which saw active discussion amongst attendees on key issues in India’s emerging data protection regime.
FPF’s India Luncheon co-hosted with Khaitan & Co, July 10, 2025.
6. Conclusion
In all, it has been another deeply fruitful and meaningful year for FPF at Singapore’s PDP Week 2025. Through our panels, engagements, and curated roundtable sessions, FPF is proud to have been able to continue to drive thoughtful and earnest dialogue on data protection, AI, and responsible innovation across the APAC region. These engagements reflect our ongoing commitment to fostering greater collaboration and understanding among regulators, industry, academia, and civil society.
Looking ahead, FPF remains focused on shaping thoughtful approaches to privacy and emerging technologies. We are grateful for the continued support of the IMDA, IAPP, as well as our members, partners, and participants, who helped make these events a memorable success.
Balancing Innovation and Oversight: Regulatory Sandboxes as a Tool for AI Governance
Thanks to Marlene Smith for her research contributions.
As policymakers worldwide seek to support beneficial uses of artificial intelligence (AI), many are exploring the concept of “regulatory sandboxes.” Broadly speaking, regulatory sandboxes are legal oversight frameworks that offer participating organizations the opportunity to experiment with emerging technologies within a controlled environment, usually combining regulatory oversight with reduced enforcement. Sandboxes often encourage organizations to use real-world data in novel ways, with companies and regulators learning how new data practices are aligned – or misaligned – with existing governance frameworks. The lessons learned can inform future data practices and potential regulatory revisions.
In recent years, regulatory sandboxes have gained traction, in part due to a requirement under the EU AI Act that regulators in the European Union adopt national sandboxes for AI. Jurisdictions across the world, such as Brazil, France, Kenya, Singapore, and the United States (Utah) have introduced AI-focused regulatory sandboxes, offering current, real-life lessons for the role they can play in supporting beneficial use of AI while enhancing clarity about how legal frameworks apply to nascent AI technologies. More recently, in July 2025, the United States’ AI Action Plan recommended that federal agencies in the U.S. establish regulatory sandboxes or “AI Centers of Excellence” for organizations to “rapidly deploy and test AI tools while committing to open sharing of data and results.”
As AI systems grow more advanced and widespread, their complexity poses significant challenges for legal compliance and effective oversight. Regulatory sandboxes can potentially address these challenges. The probabilistic nature of advanced AI systems, especially generative AI, can make AI outputs less certain, and legal compliance therefore less predictable. Simultaneously, the rapid global expansion of AI technologies and the desire to “scale up” AI use within organizations has outpaced the development of traditional legal frameworks. Finally, the global regulatory landscape is increasingly fragmented, which can cause significant compliance burdens for organizations. Depending on how they are structured and implemented, regulatory sandboxes can address or mitigate some of these issues by providing a controlled and flexible environment for AI testing and experimentation, under the guidance and oversight of policymakers. This framework can help ensure responsible development, reduce legal uncertainty, and inform more adaptive and forward-looking AI regulations.
1. Key Characteristics of a Regulatory Sandbox
A regulatory sandbox is an adaptable framework that can allow organizations to test out innovative new products, services, or business models with reduced regulatory requirements. Typically supervised by a regulatory body, these “testbeds” encourage experimentation and innovation in a real-world setting while managing potential risks.
The concept of a regulatory sandbox was first introduced in the financial technology (fintech) sector, with the United Kingdom launching the first one in 2015. Since then, the concept has gained global traction, especially in sectors with rapid technological advancement, such as healthcare. According to a 2025 report by the Datasphere Initiative, there are over 60 sandboxes related to data, AI, or technology in the world. Of those, 31 are national sandboxes that focus on AI innovation, including areas such as machine learning, AI development, and data-driven solutions. Over a dozen sandboxes are currently in development and expected to launch in the coming years.
Generally, a regulatory sandbox includes the following characteristics:
Established by a legal authority: Regulatory sandboxes are typically established by a regulatory authority or a specific law (sometimes part of a broader law) that also provides limited waivers or protection against enforcement.
Regulatory oversight: Supervision often falls under an existing regulator, agency, or oversight body, usually the same one that is responsible for enforcement of the relevant sector or technology. At times, a supervisory body is expressly created for oversight purposes. The supervising body typically has some discretionas to how to implement its sandbox, such as the focus (technology or sector-specific), the vetting process, the number of accepted applicants, and evaluation metrics for success.
Application and selection: As part of the vettingprocess, participating organizations must explain to the regulatory body why they would be a good fit for that sandbox (e.g., establish that they have sufficient technological maturity, operate in a sector of public interest, and be willing to share practical insights). In some sandboxes, priority is given to startups and small- and medium-sized enterprises (SMEs).
Cohorts and time limits: Organizations are usually grouped into small cohorts, often ranging from four to twenty organizations. These cohorts often focus on a specific technology (e.g., generative AI) or sector (e.g., healthcare). The sandbox usually has a defined testing time period, which could be as short as three months or as long as two years. During that period, there is regular engagement between the regulator and sandbox participants, although the cadence and scope depends on the sandbox and the supervisory body.
Post-sandbox reporting: At the end of the sandbox period, the supervisory body often compiles a report of best practices, lessons learned, technical guidance, and/or compliance tools. This report may be shared publicly, or shared only with government stakeholders, such as the national legislature or other agencies.
Depending on their design, regulatory sandboxes can offer a number of benefits to different stakeholders:
For regulators, sandboxes can encourage data-informed policy by raising concerns or opportunities that legislators can address in real-time during the legislative process. Agencies and other regulatory bodies can also build capacity as they work with industry to understand latest developments in technology and how industry is using these developments. Sandboxes also help regulators develop best practices around how they would utilize existing authorities regarding these organizations or sectors, especially since it might not be clear to regulators (and organizations) how new technologies or practices could interact with established laws.
For organizations (especially businesses), sandboxes can provide regulatory certainty, reduce time to market, foster knowledge sharing both with regulators and with other organizations, and allow organizations to position themselves as forerunners in AI development and governance. These benefits are particularly salient for startups and SMEs, who might not have the funds or capacity to ensure their organization is complying with complex regulations when those rules intersect with rapidly developing technologies.
For consumers and the public, sandboxes can provide assurance that participating AI services and products are tested under real-world conditions. As regulators publicly report on sandbox takeaways, both the public and private sector can learn from participants’ best practices.
2. Notable Jurisdictions with AI-Focused Regulatory Sandboxes
Across the globe, a growing number of governments are exploring AI-focused regulatory sandboxes. In the European Union, this growth has been partly driven by a requirement in the EU Artificial Intelligence Act (EU AI Act), passed in 2024 as part of the EU digital strategy. The EU AI Act requires all EU Member States to establish a national or regional regulatory sandbox for AI, with a particular emphasis on annual reporting, tailored training, and priority access for startups and SMEs. In doing so, Member States have taken a variety of different approaches in how they develop, structure, and implement regulatory sandboxes. Beyond the EU, global jurisdictions have similarly taken a broad range of approaches.
Among the approximately thirty jurisdictions with AI-related sandboxes, a few notable examples can offer a useful review of the landscape. In this section, we describe five jurisdictions from a cross-section of global geographies, representing a range of goals and legal approaches: Brazil, France, Kenya, Singapore, and the United States (Utah). Each offers unique lessons for the timing of sandboxes relative to regulation, regulatory requirements for participants, and policy goals.
Brazil: A Sandbox Launched Before Legislation
Brazil is one of the few countries that launched a national AI regulatory sandbox before enacting an AI law. Brazil’s sandbox focuses on machine learning-driven technologies, including generative AI, where the Brazilian Data Protection Authority (ANPD) will oversee selected projects with the involvement of a variety of stakeholders, including academics and civil society organizations.In recent years, regulators have emphasized several goals for its sandbox, including nurturing innovation while implementing best practices “to ensure compliance with personal data protection rules and principles.” Brazil’s AI bill establishes sandboxes as a tool in its compliance regime: organizations that are in violation of the proposed Act may be restricted from participating in the AI sandbox program for up to five years.
France: An Annual Sandbox Focused on Specific Policy Issues
In France, the French Data Protection Authority (La Commission nationale de l’informatique et des libertés or CNIL), has run an annual regulatory sandbox for the last three years, with each year focused on a different national digital policy goal. This past year, the sandbox focused on “AI and public services,” exploring how AI can be responsibly deployed in sectors such as employment, utilities, and transportation. CNIL provided advice on issues such as automated decision-making, data minimization, and bias mitigation. This year, the sandbox will focus on the “silver [elderly] economy,” exploring AI solutions to support aging populations. Out of over fifteen applications, CNIL selected six projects, three of which include a data-sharing system to improve home care (O₂), an AI-based acoustic monitoring tool for care homes (OSO-AI), and a mobile app that tracks seniors’ autonomy and alerts families or caregivers (Neural Vision).
Kenya: Multiple Sandboxes to Address Different Markets
Kenya operates two regulatory sandboxes in AI: (1) the Communications Authority of Kenya (CA) oversees a sandbox that focuses on Information and Communications Technology (ICT), including e-learning and e-health platforms that deploy AI. Participants may be local or international, and must submit regular reports that detail performance indicators and other metrics; and (2) the Capital Markets Authority (CMA) oversees a second regulatory sandbox that focuses on innovative technologies in the finance and capital markets sector. Participants can receive feedback and guidance from the CMA and other stakeholders on AI products such as robo-advisory services, blockchain applications, and crowdfunding platforms.
Singapore: A Collaboration-Focused Sandbox Model
Singapore’s “Generative AI Evaluation Sandbox” brings together key stakeholders, including model developers, app deployers and third party “testers,” to evaluate generative AI products and develop common standardized evaluation approaches. Participants collaboratively assess generative AI technologies through an “Evaluation Catalogue,” which compiles common technical testing tools and recommends a baseline set of evaluation tests for generative AI products. The Generative AI Evaluation Sandbox is overseen by the Infocomm Media Development Authority (IMDA), a statutory board that regulates Singapore’s infocommunications, media and data sectors and oversees private-sector AI governance in Singapore, and the AI Verify Foundation, a not-for-profit subsidiary wholly owned by the IMDA that drives Singapore’s AI governance testing efforts, including an AI governance testing framework and toolkit. More recently, in July 2025, Singapore announced the launch of another sandbox, the “Global AI Assurance Sandbox,” to address agentic AI and risks such as data leakage and vulnerability to prompt injections.
Utah (United States): The First AI-Focused Regulatory Sandbox in the U.S.
In the United States, Utah is the first state to operate an AI-focused regulatory sandbox (although it may not be the last, with the enactment of the 2025 Texas Responsible AI Governance Act and Delaware’s House Joint Resolution 7). In 2024, Utah passed the Utah AI Policy Act (UAIP), which established the Office of Artificial Intelligence Policy to oversee the Utah AI laboratory program (AI Lab). Utah’s office has broad authority to grant entities up to two years of “regulatory mitigation” while they develop pilot AI programs and receive feedback from key stakeholders, including industry experts, academics, regulators, and community members. Mitigation measures include exemptions from applicable state regulations and laws, capped penalties for civil fines, and cure periods to address compliance issues. The AI Lab’s first half-year focused on mental health, and resulted in a bill that regulates AI mental health chatbot use (HB 452).
3. Policy Considerations for AI
Modern AI systems, particularly generative AI systems, can behave unpredictably or in ways that can be challenging to explain. This can lead to uncertain outcomes and make legal compliance for data protection laws, such as the California Consumer Privacy Act (CCPA) and the General Data Protection Regulation (GDPR), harder to assess in advance of the system being deployed. Scalability is also a distinct issue for AI, as it presents both technical and legal hurdles, requiring organizations to manage evolving data, outdated models, and regulatory risks. Finally, the fragmented legal landscape for global AI regulation increases compliance burdens and uncertainty for organizations, especially for startups and SMEs. While regulatory sandboxes are not a panacea for AI governance, each of these issues can be potentially mitigated or addressed by sandboxes.
Machine Learning and Generative AI Can Create Unpredictable Results
As AI systems become increasingly advanced, they can present a challenge for legal compliance due to their lack of deterministic outcomes.1 Modern AI systems, particularly those powered by machine learning or transformer architecture, involve vast numbers of parameters and are trained on very large, sometimes poorly documented, datasets. When deployed in real-world settings, these systems can exhibit behaviors that are difficult to predict, explain, or control. This can include issues like data shifts (when training fails to produce a good model because the data or conditions do not match real-world examples) or underspecification (when models pass internal tests but fail to perform as well in the real world). This unpredictability can arise from many factors, including the scale and complexity of AI systems, reliance on opaque training data, and the accelerating pace of AI development. Generative AI, in particular, relies on transformer architecture that behaves probabilistically.
As a result, the non-deterministic nature of such AI systems can make it difficult to align them with existing legal frameworks and compliance obligations. For example, under CCPA, consumers have the right to know what personal information is collected and how it is used, and access, delete, or correct their personal information. Similarly, the GDPR provides individuals with rights regarding automated decision-making, including the right to an explanation of decisions made solely by automated processes. Under both CCPA and the GDPR, it can be difficult to apply rules that assume deterministic outcomes to AI-driven decisions because some AI results (outputs) can vary even with the same or similar inputs.
In the face of these challenges, regulatory sandboxes can offer a structured solution by allowing AI systems to be tested in real-world environments under regulatory supervision. This enables regulators to observe how AI behaves with unforeseen variables and to identify and address those risks early; it also provides information the organization can use to update or iterate its model. For example, in France, CNIL worked with the company France Travail as part of their 2024 regulatory sandbox program to assess how its generative AI tool for jobseekers could provide effective results while ensuring adherence to GDPR’s data minimization principles. Because the tool is based on a large language model (LLM), it includes the inherent risk of generating results that are unpredictable or challenging to explain. Following the sandbox program, CNIL issued recommendations for generative AI systems to implement “harmonized and standardized prompts” directing users to enter as little personal data as possible, and filters to block terms related to sensitive personal data. Through this iterative process, France’s regulators were able to refine their legal approach to a complex emerging technology, while organizations (including France Travail) were able to benefit from early guidance, increasing legal certainty and reducing the likelihood of harmful outcomes or regulatory violations.
AI Scalability Poses Technical and Legal Challenges
AI scalability, or expanding the use of AI technologies to match the pace of business demand, has emerged as both a driver of innovation and a costly business challenge. Organizations must navigate a range of technical issues, such as evolving complex data sets, obsolete models, and security issues, which can delay product delivery timelines or result in financial penalties for non-compliance with an applicable law. Beyond the technical issues, scaling AI also requires the organization to regularly review and maintain internal standards for security, legal and regulatory compliance, and ethics.
By participating in a regulatory sandbox, organizations can address these challenges and stay aligned with the global patchwork of AI governance through the opportunity to test AI products with regular oversight, minimizing the risks of market delays, product recalls, or regulatory fines. Kenya is an example of how many organizations and governments seek to harness AI’s potential with the specific goal of enabling scalability. The KenyaNational Artificial Intelligence Strategy 2025-2030 seeks to align its policy ambitions with broader digital policy trends across sub-Saharan Africa and beyond, while staying grounded in local data and market ecosystems. Kenya’s two AI sandboxes reflect its desire to take advantage of domestic priority AI markets and global trends in AI scalability.
The AI Regulatory Landscape Continues to Rapidly Evolve
Global AI regulation is constantly evolving, with jurisdictions taking diverse approaches that reflect different regions’ unique priorities and challenges. In Europe, the EU AI Act has multiple compliance deadlines through 2030; African countries are testing a phased implementation approach to AI; Latin America is launching a variety of strategies and sandboxes; and in the Asia-Pacific region, several key jurisdictions have adopted regulatory frameworks that are generally limited to voluntary ethical principles and guidelines.
In the United States, the absence of a comprehensive federal AI or privacy framework has led to a patchwork of state-level efforts. In 2024, nearly 700 AI or AI-adjacent bills were introduced in state legislatures. These efforts vary widely in scope and focus. Some states have proposed relatively broad laws aimed at consumer protection and high-impact areas, while others have proposed more targeted rules or sector-specific regulation (e.g., legislation that would protect children, regulate AI hiring tools, or address deepfakes).
As a result, navigating the evolving landscape without regulatory certainty has become a practical challenge for organizations. Innovation typically outpaces law, and as differing legal standards emerge and evolve, organizations must navigate conflicting or overlapping requirements. This can increase compliance costs and delay product development, especially in situations where regulations remain ambiguous or are still under consideration. Startups and SMEs are particularly impacted by compliance costs, as they may not have the financial support or infrastructure to weather a long period of legal uncertainty.
Depending on the relevant jurisdiction, regulatory sandboxes can offer greater legal certainty by providing a degree of legal immunity for liability or penalties, similar to a “safe harbor.” In doing so, they can reduce time to market and reduce costs associated with uncertainty. Some jurisdictions, such as France (under the EU AI Act), explicitly require sandboxes to support and accelerate market access for SMEs and start-ups.
In many cases, a sandbox can lead to stronger relationships between lawmakers and other stakeholders, and an opportunity for experts to shape policymaking directly while organizations await regulatory guidance. For example, Utah’s sandbox, the “AI Lab,” focused on mental health in its first year, and state legislators subsequently passed a law that regulates mental health AI chatbots in Utah. In a similar vein, Brazil launched a national AI regulatory sandbox before enacting an AI law, and findings from the sandbox could inform the final version of legislation. Many other sandboxes, most notably in Singapore, take a “light touch” approach that prioritizes iterative guidance, rather than hard law.
At the same time, regulatory sandboxes can offer legal protections only within their own jurisdictional scope of authority. As a result, sandboxes may vary in their practical ability to offer legal certainty. In other words, a company that receives a regulatory waiver from laws in one jurisdiction (such as Utah) is not protected against liability arising under other jurisdictions (such as California, federal, or global laws). As a result, regulator collaboration across jurisdictions can have significant impact, with many opportunities for legal reciprocity and knowledge sharing.
4. Looking Ahead
The use of regulatory sandboxes continues to expand as global policymakers recognize their value in fostering innovation while ensuring responsible AI governance. Just recently, in July 2025, Singapore launched a new sandbox to address emerging challenges in AI, including the deployment of AI agents. Lessons learned from each of these five jurisdictions showcase that sandboxes can stimulate AI development, enhance consumer protections, and help regulators develop more effective policies.
As policymakers consider different approaches to regulating AI, it is crucial to integrate the lessons learned from these sandboxes. By offering flexible regulatory frameworks that prioritize real-world testing, multi-stakeholder cooperation, and iterative feedback, sandboxes can help balance the need for AI innovation with safeguarding the public interest.
These non-deterministic outcomes, or when an AI system results in a different outcome despite the same conditions make it difficult to assign responsibility when AI-driven decisions can lead to unintended results. In contrast, a deterministic AI would make the same chess move every time, given the same board set up, whereas a probabilistic (or non-deterministic) model would learn from previous experiences and adapt its move accordingly. ↩︎
Practical Takeaways from FPF’s Privacy Enhancing Technologies Workshop
In April, the Future of Privacy Forum and the Mozilla Foundation hosted an all-day workshop with technology, legal, and policy experts to explore Privacy Enhancing Technologies (PETs). During the workshop, multiple companies presented technologies they developed and implemented to preserve individuals’ privacy. In addition, the participants discussed steps for broadening the adoption of these technologies and their intersection with data protection laws.
Mastercard’s Chief Privacy Officer, Caroline Louveaux, presented the first PET, a privacy-preserving technology tested in a new cross-border fraud detection system. Louveaux presented how the system employs Fully Homomorphic Encryption (FHE), a technique that enables analysis of encrypted data, and the participants discussed the benefits related to privacy and broader compliance requirements this technique captures.
The second PET presentation by Robert Pisarczyk, CEO and Co-Founder of Oblivious, was an overview of how Oblivious implemented a privacy-perserving technology in partnership with an insurance company to tackle a common tension between data privacy and utility. The companies applied differential privacy techniques to retain information from personal data while complying with legal requirements to delete it. By anonymizing data before deletion, differential privacy allows businesses to generate summaries, trends, and patterns that do not compromise individual privacy. The participants discussed this new technique through the lens of data deletion, and whether differential privacy meets the requirements for it under existing data protection laws like the GDPR.
Common themes that arose during the workshop included:
PETs may assist companies with broader financial regulation compliance in addition to data protection regulations;
The lack of clear regulatory guidance forces organizations to rely on best practices, trust, and adherence to local laws between/across jurisdictions;
No single PET or technology can address every risk or threat; in many cases, multiple PETs and privacy protections are beneficial and can help reduce risk;
Synthetic data and regulatory sandboxes were essential for PETs’ development and proof of concept; and
Companies need more incentives to develop and implement PETs, especially given their high cost and regulatory uncertainty.
The Research Coordination Network (RCN) for Privacy-Preserving Data Sharing and Analytics is supported by the U.S. National Science Foundation under Award #2413978 and the U.S. Department of Energy, Office of Science under Award #DE-SC0024884.
Data-Driven Pricing: Key Technologies, Business Practices, and Policy Implications
In the U.S., state lawmakers are seeking to regulate various pricing strategies that fall under the umbrella of “data-driven pricing”: practices that use personal and/or non-personal data to continuously inform decisions about the prices and products offered to consumers. Using a variety of terms—including “surveillance,” “algorithmic,” and “personalized” pricing—legislators are targeting a range of practices that often look different from one another, and carry different benefits and risks. Generally speaking, these practices fall under one of four categories:
Reward or loyalty program: A company offers a discount, reward, or other incentive to repeat customers who sign up for the program. In return, the company receives additional customer data.
Dynamic pricing: Rapidly changing the price of a particular product or service based on real-time analysis of market conditions and consumer behavior.
Consumer segmentation or profiling: A profile is created for a customer based on their personal data, including behavior and/or characteristics, and they are placed within a particular audience segment. Based on the profile or segment, they receive particular advertisements, prices, or promotions.
Search or product ranking: Altering the order in which search results or products appear, to give more prominence to certain results, based on general consumer data or specific customer behavioral data. This could potentially include changing the prominence of given products based on their price.
This resource distinguishes between these different pricing strategies in order to help lawmakers, businesses, and consumers better understand how these different practices work.
Tech to Support Older Adults and Caregivers: Five Privacy Questions for Age Tech
Introduction
As the U.S. population ages, technologies that can help support older adults are becoming increasingly important. These tools, often called “AgeTech”, exist at the intersection of health data, consumer technology, caregiving relationships, and increasingly, artificial intelligence, and are drawing significant investment. Hundreds of well funded start-ups have launched. Many are of major interest to governments, advocates for aging populations, and researchers who are concerned about the impact on the U.S. economy when a smaller workforce supports a large aging population.
AgeTech may include everything from fall detection wearables and remote vital sign monitors to AI-enabled chatbots and behavioral nudging systems. These technologies promise greater independence for older adults, reduced burden on caregivers, and more continuous, personalized care. But that promise brings significant risks, especially when these tools operate outside traditional health privacy laws like HIPAA and instead fall under a shifting mix of consumer privacy regimes and emerging AI-specific regulations.
A recent review by FPF of 50 AgeTech products reveals a market increasingly defined by data-driven insights, AI-enhanced functionality, and personalization at scale. Yet despite the sophistication of the technology, privacy protections remain patchy and difficult to navigate. Many tools were not designed with older adults or caregiving relationships in mind, and few provide clear information about how AI is used or how sensitive personal data feeds into machine learning systems.
Without frameworks for trustworthiness and subsequent trust from older adults and caregivers, the gap between innovation and accountability will continue to grow, placing both individuals and companies at risk. Further, low trust may result in barriers to adoption at a time when these technologies are urgently needed as the aging population grows and care shortages continue.
A Snapshot of the AgeTech Landscape
AgeTech is being deployed across both consumer and clinical settings, with tools designed to serve four dominant purposes:
Health Monitoring or managing a specific health condition such as dementias, heart conditions, or other long-term conditions.
Remote Monitoring which may include location-tracking by a family member or other caregiver, bodily mobility monitoring by a provider, or other general monitoring not related to a specific condition or diagnosis.
Daily Task Support including appointments, medication adherence, meals, and errands.
Emergency Use such as fall detection and prevention, emergency communications, and alerts.
Clinical applications are typically focused on enabling real-time oversight and remote data collection, while consumer-facing products are aimed at supporting safety, independence, and quality of life at home. Regardless of setting, these tools increasingly rely on combinations of sensors, mobile apps, GPS, microphones, and notably, AI used for everything from fall detection and cognitive assistance to mood analysis and smart home adaptation.
AI is becoming central to how AgeTech tools operate and how they’re marketed. But explainability remains a challenge and disclosures around AI use can be vague or missing altogether. Users may not be told when AI is interpreting their voice, gestures, or behavior, let alone whether their data is used to refine predictive models or personalize future content.
For tools that feel clinical but aren’t covered by HIPAA, this creates significant confusion and risk. A proliferation of consumer privacy laws, particularly emerging state-level privacy laws with health provisions are starting to fill the gap, leading to complex and fragmented privacy policies. For all stakeholders seeking to improve and support aging through AI and other technologies, harmonious policy-based and technical privacy protections are essential.
AgeTech Data is Likely in Scope of Many States’ Privacy Laws
Compounding the issue is the reality that these tools often fall into regulatory gray zones. If a product isn’t offered by a HIPAA-covered entity or used in a reimbursed clinical service, it may not be protected under federal health privacy law at all. Instead, protections depend on the state where a user lives, or whether the product falls under one of a growing number of state-level privacy laws or consumer health privacy laws.
Laws like New York’s S929/NY HIPA, which remains in legislative limbo, reflect growing state interest in regulating sensitive and consumer health data that would likely be collected by AgeTech devices and apps. These laws are a step toward closing a gap in privacy protections, but they’re not consistent. Some focus narrowly on specific types of health data individually or in tandem with AI or other technologies. For example, mental health chatbots (Utah HB452), reproductive health data (Virginia SB754), or AI disclosures in clinical settings (California AB3030). Other bills and laws have broad definitions that include location, movement, and voice data, all common types of data in our survey of AgeTech. Regulatory obligations may vary not just by product type, but by geography, payment model (where insurance may cover a product or service), and user relationship.
Consent + Policy is Key to AgeTech Growth and Adoption
In many cases, it is not the older adult but a caregiver, whether a family member, home health aide, or neighbor, who initiates AgeTech use and agrees to data practices. These caregiving relationships are diverse, fluid, and often informal. Yet most technologies assume a static one-to-one dynamic and offer few options for nuanced role-based access or changing consent over time.
For this reason, AgeTech is a good example of why consent should not be the sole pillar of data privacy. While important, relying on individual permissions can obscure the need for deeper infrastructure and policy solutions that relieve consent burdens while ensuring privacy. Devices and services that align privacy protections with contextual uses and create pathways for evidence-based, science-backed innovation that benefits older adults and their care communities are needed.
Five Key Questions for AgeTech Privacy
To navigate this complexity and build toward better, more trustworthy systems, privacy professionals and policymakers can start by asking the following key questions:
Is the technology designed to reflect caregiving realities?
Caregiving relationships are rarely linear. Tools must accommodate shared access, changing roles, and the reality that caregivers may support multiple people, or that multiple people may support the same individual. Regulatory standards should reflect this complexity, and product designs should allow for flexible access controls that align with real-world caregiving.
Does the regulatory classification reflect the sensitivity of the data, not just who offers the tool?
Whether a fall alert app is delivered through a clinical care plan or bought directly by a consumer, it often collects the same data and has the same impact on a person’s autonomy. Laws should apply based on function and risk. Laws should also consider the context and use of data in addition to sensitivity. Emerging state laws are beginning to take this approach, but more consistent federal leadership is needed.
Are data practices accessible, not just technically disclosed?
Especially in aging populations, accessibility is not just about font size, it’s about cognitive load, clarity of language, and decision-making support. Tools should offer layered notices, explain settings in plain language, and support revisiting choices as health or relationships change. Future legislation could require transparency standards tailored to vulnerable populations and caregiving scenarios.
Does the technology reinforce autonomy and dignity?
The test for responsible AgeTech is not just whether it works, but whether it respects. Does the tool allow older adults to make choices about their data, even when care is shared or delegated? Can those preferences evolve over time? Does it reinforce the user’s role as the central decision-maker, or subtly replace their agency with automation?
If a product uses or integrates AI, is it clearly indicated if and how data is used for AI?
AI is powering an increasing share of AgeTech’s functionality—but many tools don’t disclose whether data is used to train algorithms, personalize recommendations, or drive automated decisions. Privacy professionals should ask: Is AI use clearly labeled and explained to users? Are there options to opt out of certain AI-driven features? Is sensitive data (e.g., voice, movement, mood) being reused for model improvement or inference? In a rapidly advancing field, transparency is essential for building trustworthy AI.
A Legislative and Technological Path Forward
Privacy professionals are well-positioned to guide both product development and policy advocacy. As AgeTech becomes more central to how we deliver and experience care, the goal should not be to retrofit consumer tools into healthcare settings without safeguards. Instead, we need to modernize privacy frameworks to reflect the reality that sensitive, life-impacting technologies now exist outside the clinic.
This will require:
Consistent legislation that protects sensitive data regardless of who collects it;
Design standards that account for caregiving dynamics and decision-making capacity;
Craft AI frameworks in collaboration with stakeholders that foster transparency and safety;
Infrastructure for shared access and evolving consent, not just checkbox compliance.
The future of aging with dignity will be shaped by whether we can build privacy into the systems that support it. That means moving beyond consent and toward real protections, at the policy level, in the technology stack, and in everyday relationships that make care possible.
Nature of Data in Pre-Trained Large Language Models
The following is a guest post to the FPF blog by Yeong Zee Kin, the Chief Executive of the Singapore Academy of Law and FPF Senior Fellow. The guest blog reflects the opinion of the author only. Guest blog posts do not necessarily reflect the views of FPF.
The phenomenon of memorisation has fomented significant debate over whether Large Language Models (LLM) store copies of the data that they are trained on.1 In copyright circles, this has led to lawsuits such as the one by the New York Times against OpenAI that alleges that ChatGPT will reproduce NYT articles nearly verbatim.2 While in the privacy space, much ink has also been spilt over the question whether LLMs store personal data.
This blog post commences with an overview of what happens to data that is processed during LLM training3: first, how data is tokenised, and second, how the model learns and embeds contextual information within the neural network. Next, it discusses how LLMs store data and contextual information differently from classical information storage and retrieval systems, and examines the legal implications that arise from this. Thereafter, it attempts to demystify the phenomenon of memorisation, to gain a better understanding of why partial regurgitation occurs. This blog post concludes with some suggestions on how LLMs can be used in AI systems for fluency, while highlighting the importance of providing grounding and the safeguards that can be considered when personal data is processed.
While this is not a technical paper, it aims to be sufficiently technical so as to provide an accurate description of the relevant internal components of LLMs and an explanation of how model training changes them. By demystifying how data is stored and processed by LLMs, this blog post aims to provide guidance on where technical measures can be most effectively applied in order to address personal data protection risks.
What are the components of a Large Language Model?
LLMs are causal language models that are optimised for predicting the next word based on previous words.4 An LLM comprises a parameter file, a runtime script and configuration files.5 The LLM’s algorithm resides in the script, which is a relatively small component of the LLM.6 Configuration and parameter files are essentially text files (i.e. data).7 Parameters are the learned weights and biases,8 expressed as numerical values, that are crucial for the model’s prediction: they represent the LLM’s pre-trained state.9 In combination, the parameter file, runtime script and configuration files form a neural network.
There are two essential stages to model training. The first stage is tokenisation. This is when training data is broken down into smaller units (i.e. segmented) and converted into tokens. For now, think of each token as representing a word (we will discuss subword tokenisation later). Each token is assigned a unique ID. The mapping of each token to its unique ID is stored in a lookup table, which is referred to as the LLM’s vocabulary. The vocabulary is one of the LLM’s configuration files. The vocabulary plays an important role during inference: it is used to encode input text for processing and decode output sequences back into human-readable text (i.e. the generated response).
Figure 1. Sample vocabulary list from GPT-Legal; each token is associated with an ID (the vocabulary size of GPT-Legal is 128,256 tokens).
The next stage is embedding. This is a mathematical process that distills contextual information about each token (i.e. word) from the training data and encodes it into a numerical representation known as a vector. A vector is created for each token: this is known as the token vector. During LLM training, the mathematical representations of tokens (their vectors) are refined as the LLM learns from the training data. When LLM training is completed, token vectors are stored in the trained model. The mapping of the unique ID and token vector is stored in the parameter file as an embedding matrix. Token vectors are used by LLMs during inference to create the initial input vector that is fed through the neural network.
Figure 2. Sample embedding matrix from GPT-Legal: each row is one token vector, each value is one dimension (GPT-Legal has 128,256 token vectors, each with 4,096 dimensions)
LLMs are neural networks that may be visualised as layers of nodes with connections between them.10 Adjustments to embeddings also take place in the neural network during LLM training. Model training adjusts the weights and biases of the connections between these nodes. This changes how input vectors are transformed as they pass through the layers of the neural network during inference. This produces an output vector that the LLM uses to compute a probability score for each potential token that may follow, which increases or decreases the probability that one token will follow another. The LLM uses these probability scores to select the next token through various sampling methods.11 This is how LLMs predict the next token when generating responses.
In the following sections, we dive deeper into each of these stages to better understand how data is processed and stored in the LLM.
Stage 1: Tokenisation of training data
During the tokenisation stage, text is converted into tokens. This is done algorithmically by applying the chosen tokenisation technique. There are different methods of tokenisation, each with its benefits and limitations. Depending on the tokenisation method used, each token may represent a word or a subword (i.e. segments of the word).
The method that is commonly used in LLMs is subword tokenisation.12 It provides benefits over word-level tokenisation, such as a smaller vocabulary, which can lead to more efficient training.13 Subword tokenisation analyses the training corpus to identify subword units based on the frequency with which a set of characters occurs. For example, “pseudonymisation” may be broken up into “pseudonym” and “isation”; while, “reacting” may be broken up into “re”, “act” and “ing”. Each subword forms its own token.
Taking this approach results in a smaller vocabulary since common prefixes (e.g. “re”) and suffixes (e.g. “isation” and “ing”) have their own tokens that can be re-used in combination with other stem words (e.g. combining with “mind” to form “remind” and “minding”). This improves efficiency during model training and inference. Subword tokens may also contain white space or punctuation marks. This enables the LLM to learn patterns, such as which subwords are usually prefixes, which are usually suffixes, and how frequently certain words are used at the start or end of a sentence.
Subword tokenisation also enables the LLM to handle out-of-vocabulary (OOV) words. This happens when the LLM is provided with a word during inference that it did not encounter during training. By segmenting the new word into subwords, there is a higher chance that the subwords of the OOV word are found in its vocabulary. Each subword token is assigned a unique ID. The mapping of a token with its unique ID is stored in a lookup table in a configuration file, known as the vocabulary, which is a crucial component of the LLM. It should be noted that this is the only place within the LLM where human-readable text appears. The LLM uses the unique ID of the token in all its processing.
The training data is encoded by replacing subwords with their unique ID before processing.14 This process of converting the original text into a sequence of IDs corresponding to tokens is referred to as tokenisation. During inference, input text is also tokenised for processing. It is only at the decoding stage that human-readable words are formed when the output sequence is decoded by replacing token IDs with the matching subwords in order to generate a human-readable response.
Stage 2: Embedding contextual information
Complex contextual information can be reflected as patterns in high-dimensional vectors. The greater the complexity, the higher the number of features that are needed. These are reflected as parameters of the high dimension vectors. Contrariwise, low dimension vectors contain fewer features and have lower representational capacity.
The embedding stage of LLM training captures the complexities of semantics and syntax as high dimension vectors. The semantic meaning of words, phrases and sentences and the syntactic rules of grammar and sentence structure are converted into numbers. These are reflected as values in a string of parameters that form part of the vector. In this way, the semantic meaning of words and relevant syntactic rules are embedded in the vector: i.e. embeddings.
During LLM training, a token vector is created for each token. The token vector is adjusted to reflect the contextual information about the token as the LLM learns from the training corpus. With each iteration of LLM training, the LLM learns about the relationships of the token, e.g. where it appears and how it relates to the tokens before and after. In order to embed all this contextual information, the token vector has a large number of parameters, i.e. it is a high dimension vector. At the end of LLM training, the token vector is fixed and stored in the pre-trained model. Specifically, the mapping of unique ID and token vector is stored as an embedding matrix in the parameter file.
Model training also embeds contextual information in the layers of the neural network by adjusting the connections between nodes. As the LLM learns from the training corpus during model training, the weights of connections between nodes are modified. These adjustments encode patterns from the training corpus that reflect the semantic meaning of words and the syntactic rules governing their usage.15 Training may also increase or decrease the biases of nodes. Adjustments to model weights and bias affect how input vectors are transformed as they pass through the layers of the neural network. These are reflected in the model’s parameters. Thus, contextual information is also embedded in the layers of the neural network during LLM training. Contextual embeddings form the deeper layers of the neural network.
Contextual embeddings increase or decrease the likelihood that one token will follow another when the LLM is generating a response. During inference, the LLM converts the input text into tokens and looks up the corresponding token vector from its embedding matrix. The model also generates contextual representations that capture how the token relates to other tokens in the sequence. Next, the LLM creates an input vector by combining the static token vector and the contextual vector. As input vectors pass through the neural network, they are transformed by the contextual embeddings in its deeper layers. Output vectors are used by the LLM to compute probability scores for the tokens, which reflect the likelihood that one subword (i.e. token) will follow another. LLMs generate responses using the computed probability scores. For instance, based on these probabilities, it is more likely that the subword that follows “re” is going to be “mind” or “turn” (since “remind” and “return” are common words), less likely to be “purpose” (unless the training dataset contains significant technical documents where “repurpose” is used); and extremely unlikely to be “step” (since “restep” is not a recognised word).
Thus, LLMs capture the probabilistic relationships between tokens based on patterns in the training data and as influenced by training hyperparameters. LLMs do not store the entire phrase or textual string that was processed during the training phase in the same way that this would be stored in a spreadsheet, database or document repository. While LLMs do not store specific phrases or strings, they are able to generalise and create new combinations based on the patterns they have learnt from the training corpus.
2. Do LLMs store personal data?
Personal data is information about an individual who can be identified or is identifiable from the information on its own (i.e. direct) or in combination with other accessible information (i.e. indirect).16 From this definition, several pertinent characteristics of personal data may be identified. First, personal data is information in the sense that it is a collection of several datapoints. Second, that collection must be associated with an individual. Third, that individual must be identifiable from the collection of datapoints alone or in combination with other accessible information. This section examines whether data that is stored in LLMs retain these qualities.
An LLM does not store personal data in the way that a spreadsheet, database or document repository stores personal data. Billing and shipping information about a customer may be stored as a row in a spreadsheet; the employment details, leave records, and performance records of an employee may be stored as records in the tables of a relational database; and the detailed curriculum vitae of prospective, current and past employees may be contained in separate documents stored in a document repository. In these information storage and retrieval systems, personal data is stored intact and its association with the individual is preserved: the record may also be retrieved in its entirety or partially. In other words, each collection of datapoints about an individual is stored as a separate record; and if the same datapoint is common to multiple records, it appears in each of those records.17
Additionally, information storage and retrieval systems are designed to allow structured queries to select and retrieve specific records, either partially or in its entirety. The integrity of storage and retrieval underpins data protection obligations such as accuracy and data security (to prevent unauthorised alteration or deletion), and data subject rights such as correction and erasure.
For the purpose of this discussion, imagine that the training dataset comprises billing and shipping records that contain names, addresses and contact information such as email addresses and telephone numbers. During training, subword tokens are created from names in the training corpus. These may be used in combination to form names and may also be used to form email addresses (since many people use a variation of their names for their email address) and possibly even street names (since names are often named after famous individuals). The LLM is able to generate billing and shipping information that conform to the expected patterns, but the information will likely be incorrect or fictitious. This explains the phenomenon of hallucinations.
During LLM training, personal data is segmented into subwords during tokenisation. This adaptation or alteration of personal data amounts to processing, which is why a legal basis must be identified for model training. The focus of this discussion is the nature of the tokens and embeddings that are stored within the LLM after model training: are they still in the nature of personal data? The first observation that may be made is that many words that make up names (or other personal information) may be segmented into subwords. For example, “Edward” may not be stored in the vocabulary as is but segmented into the subwords “ed” and “ward”. Both these subwords can be used during decoding to form other words, such as “edit” and “forward”. This example shows how a word that started as part of a name (i.e. personal data), after segmentation, produces subwords that can be reused to form other types of words (some of which may be personal data, some of which may not be personal data).
Next, while the vocabulary may contain words that correspond to names or other types of identifiers, the way they are stored in the lookup table as discrete tokens removes the quality of identification from the word. A lookup table is essentially that: a table. It may be sorted by alphanumeric or chronological order (e.g. recent entries are appended to the end of the table). The vocabulary stores datapoints but not the association between datapoints that enables them to form a collection which can relate to an identifiable individual. By way of illustration, having the word “Coleman” in the vocabulary as a token is neither here nor there, since it could equally be the name of Hong Kong’s highest-ranked male tennis player (Coleman Wong) or the street that the Singapore Academy of Law is located (Coleman Street). The vocabulary does not store any association of this word to either Coleman Wong (as part of his name) or to the Chief Executive of the Singapore Academy of Law (as part of his office address).
Furthermore, subword tokenisation enables a token to be used in multiple combinations during inference. Keeping with this illustration, the token “Coleman” may be used in combination with either “Wong” or “Street” when the LLM is generating a response. The LLM does not store “Coleman Wong” as a name or “Coleman Street” as a street name. The association of datapoints to form a collection is not stored. What the LLM stores are learned patterns about how words and phrases typically appear together, based on what it observed in the training data. Hence, if there are many persons named “Coleman” in the training dataset but with different surnames, and no one else whose address is “Coleman Street”, then the LLM is likely to predict a different word after “Coleman” during inference.
Thus, LLMs do not store personal data in the same manner as traditional information storage and retrieval systems; more importantly, they are not designed to enable query and retrieval of personal data. To be clear, personal data in the training corpus is processed during tokenisation. Hence, a legal basis must be identified for model training. However, model training does not learn the associations of datapoints inter se nor the collection of datapoints with an identifiable individual, such that the data that is ultimately stored in the LLM loses the quality of personal data.18
3. What about memorisation?
A discussion of how LLMs store and reproduce data is incomplete without a discussion of the phenomenon of memorisation. This is a characteristic of LLMs that reflects the patterns of words that are found in sufficiently large quantities in the training corpus. When certain combination of words or phrases appear consistently and frequently in the training corpus, the probability of predicting that combination of words or phrases increases.
Memorisation in LLMs is closely related to two key machine learning concepts: bias and overfitting. Bias occurs when training data overrepresents certain patterns, causing models to develop a tendency toward reproducing those specific sequences. Overfitting occurs when a model learns training examples too precisely, including noise and specific details, rather than learning generalisable patterns. Both phenomena exacerbate memorisation of training data, particularly personal information that appears frequently in the dataset. For example, Lee Kuan Yew is Singapore’s first prime minister post-Independence with significant global influence; he lived at 38 Oxley Road. LLMs trained on a corpus of data from the Internet would have learnt this. Hence, ChatGPT is able to produce a response (without searching the Web) about who he is and where he lived. It is able to reproduce (as opposed to retrieve) personal data about him because they appeared in the training corpus in a significant volume. Because this sequence of words appeared frequently – and often – in the training corpus, when the LLM is given the sequence of words “Lee Kuan”, the probability of predicting “Yew” is significantly higher than any other word; and in the context of name and address of Singapore’s first prime minister, the probability of predicting Lee Kuan Yew and 38 Oxley Road is significantly higher than others.
This explains the phenomenon of memorisation. Memorisation occurs when the LLM learns frequent patterns and reproduces closely related datapoints. It should be highlighted that this reproduction is probabilistic. This is not the same as query and retrieval of data stored as records in deterministic information systems.
The first observation to be made is that whilst this is acceptable for famous figures, the same cannot be said for private individuals. Knowing that this phenomenon reflects the training corpus, the obvious thing to avoid is the use of personal data for training of LLMs. This exhortation applies equally to developers of pre-trained LLMs and deployers who may fine-tune LLMs or engage in other forms of post-training, such as reinforcement learning. There are ample good practices for this. Techniques may be applied on the training corpus before model training to remove, reduce or hide personal data: e.g. pseudonymisation (to de-identify individuals in the training corpus), data minimisation (to exclude unnecessary personal data) and differential privacy (adding random noise to obfuscate personal data). When inclusion of personal data in the training corpus is unavoidable, there are mitigatory techniques that can be applied to the trained model.
One such example is machine unlearning, a technique currently under active research and development, that has the potential of removing the influence of specific data points from the trained model. This technique may be applied to reduce the risk of reproducing personal data.
Another observation that may be made is that the reproduction of personal data is not verbatim but paraphrased, hence it is also referred to as partial regurgitation. This underscores the fact that the LLM does not store the associations between datapoints necessary to make them a collection of information about an individual. Even if personal data is reproduced, it is because of the high probability scores for that combination of words, and not the output of a query and retrieval function. Paraphrasing may introduce distortions or inaccuracies when reproducing personal data, such as variations in job titles or appointments. Reproduction is also inconsistent and oftentimes incomplete.
Unsurprising, since the predictions are probabilistic after all.
Finally, it bears reiterating that personal data is not stored as is but segmented into subwords, and reproduction of personal data is probabilistic, with no absolute guarantee that a collection of datapoints about an individual will always be reproduced completely or accurately. Thus, reproduction is not the same as retrieval. Parenthetically, it may also be reasoned that if the tokens and embeddings do not possess the quality of personal data, their combination during inference is processing of data, but just not the processing of personal data. Be that as it may, the risk of reproducing personal data – however, incomplete and inaccurate – can and must still be addressed. Technical measures such as output filters can be implemented as part of the AI system. These are directed at the responses generated by the model and not the model itself.
4. How should we use LLMs to process personal data?
LLMs are not designed or intended to store and retrieve personal data in the same way that traditional information storage and retrieval systems are; but they can be used to process personal data. In AI systems, LLMs provide fluency during the generation of responses. LLMs can incorporate personal data in their responses when personal data is provided, e.g., personal data provided as part of user prompts, or when user prompts cause the LLM to reproduce personal data as part of the generated response.
When LLMs are provided with user prompts that include reference documents that provide grounding for the generated response, the documents may also contain personal data. For example, a prompt to generate a curriculum vitae (CV) in a certain format may contain a copy of an outdated resume, a link to a more recent online bio and a template the LLM is to follow when generating the CV. The LLM can be constrained by well-written prompts to generate an updated CV using the personal information provided and formatted in accordance with the template. In this example, the personal data that the LLM uses will likely be from the sources that have been provided by the user and not from the LLM’s vocabulary.
Further, the LLM will paraphrase the information in the CV that it generates. The randomness of the predicted text is controlled by adjusting the temperature of the LLM. A higher temperature setting will increase the chance that a lower probability token will be selected as the prediction, thereby increasing the creativity (or randomness) of the generated response. Even at its lowest temperature setting, the LLM may introduce mistakes by paraphrasing job titles and appointments or combining information from different work experiences. These errors occur because the LLM generates text based on learned probabilities rather than factual accuracy. For this reason, it is important to vet and correct generated responses, even if proper grounding has been provided.
A more systematic way of providing grounding is through Retrieval Augmented Generation (RAG) whereby the LLM is deployed in an AI system that includes a trusted source, such as a knowledge management repository. When a query is provided, it is processed by the AI system’s embedding model which converts the entire query into an embedding vector that captures its semantic meaning. This embedding vector is used to conduct a semantic search. This works by identifying embeddings in the vector database (i.e. a database containing document embeddings precomputed from the trusted source) that have the closest proximity (e.g. via Euclidean or cosine distance).19 These distance metrics measure how similar the semantic meanings are. Embeddings that are close together (e.g. nearest neighbour) are said to be semantically similar.20 Semantically similar passages are retrieved from the repository and appended to the prompt that is sent to the LLM for the generation of a response. The AI system may generate multiple responses and select the most relevant one based on either semantic similarity to the query or in accordance with a re-ranking mechanism (e.g. heuristics to improve alignment with intended task).
5. Concluding remarks
LLMs are not designed to store and retrieve information (including personal data). From the foregoing discussion, it may be said that LLMs do not store personal data in the same manner as information storage and retrieval systems. Data stored in the LLM’s vocabulary do not retain the relationships necessary for the retrieval of personal data completely or accurately. The contextual information embedded in the token vectors and neural network reflects patterns in the training corpus. Given how tokens are stored and re-used, the contextual embeddings are not intended to provide the ability to store the relationships between datapoints such that the collection of datapoints is able to describe an identifiable individual.
By acquiring a better understanding of how LLMs store and process data, we are able to design better trust and safety guardrails in the AI systems that they are deployed in. LLMs play an important role in providing fluency during inference, but they are not intended to perform query and retrieval functions. These functions are performed by other components of the AI system, such as the vector database or knowledge management repository in a RAG implementation.
Knowing this, we can focus our attention on those areas that are most efficacious in preventing the unintended reproduction of personal data in generated responses. During model development, steps may be taken to address the risk of the reproduction of personal data. These are steps for developers who undertake post-training, such as fine tuning and reinforcement learning.
(a) First, technical measures may be applied to the training corpus to remove, minimise, or obfuscate personal data. This reduces the risk of the LLM memorising personal data.
(b) Second, new techniques like model unlearning may be applied to reduce the influence of specific data points when the trained model generates a response.
When deploying LLMs in AI systems, steps may also be taken to protect personal data. The measures are very dependent on intended use cases of the AI system and the assessed risks. Crucially, these are measures that are within the ken of most deployers of LLMs (by contrast, a very small number of deployers will have the technical wherewithal to modify LLMs directly through post-training).
(a) First, remove or reduce personal data from trusted sources if personal data is unnecessary for the intended use case. Good data privacy practices such as pseudonymisation and data minimisation should be observed.
(b) Second, if personal data is necessary, store and retrieve them from trusted sources. Use information storage and retrieval systems that are designed to preserve the confidentiality, integrity and accuracy of stored information. Personal data from trusted sources can thus be provided as grounding for prompts to the LLM.
(c) Third, consider implementing data loss prevention measures in the AI system. For example, prompt filtering reduces the risk of including unauthorised personal data in user prompts. Likewise, output filtering reduces the risk of unintended reproduction of personal data in responses generated by the AI system.
Taking a holistic approach enables deployers to introduce appropriate levels of safeguards to reduce the risks of unintended reproduction of personal data.21
Memorisation is often also known as partial regurgitation, which does not require verbatim reproduction; regurgitation, on the other hand, refers to the phenomenon of LLMs reproducing verbatim excerpts of text from their training data. ↩︎
The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work (27 Dec 2023) New York Times; see also, Audrey Hope “NYT v. OpenAI: The Times’s About-Face” (10 April 2024) Harvard Law Review. ↩︎
This paper deals with the processing of text for training LLMs. It does not deal with other types of foundational models, such as multi-model models that can handle text as well as images and audio. ↩︎
LLM model packages contain different components depending on their intended use. Inference models like ChatGPT are optimized for real-time conversation and typically share only the trained weights, tokenizer, and basic configuration files—while keeping proprietary training data, fine-tuning processes, system prompts, and foundation models private. In contrast, open source research models like LLaMA 2 often include comprehensive documentation about training datasets, evaluation metrics, reproducibility details, complete model weights, architecture specifications, and may release their foundation models for further development, though the raw training data itself is rarely distributed due to size and licensing constraints. See, e.g., https://huggingface.co/docs/hub/en/model-cards (accessed 26 June 2025). ↩︎
Configuration files are usually stored as readable text files, while parameter files are stored in compressed binary formats to save space and improve processing speed. ↩︎
An LLM that is ready for developers to use for inference is referred to as pre-trained. Developers may deploy the pre-trained LLM as is, or they may undertake further training using their private datasets. An example of such post-training is fine-tuning. ↩︎
LLMs are made up of the parameter file, runtime script and configuration files which together form a neural network: supra, fn 5 and the discussion in the accompanying main text. ↩︎
While it could pick the token with the highest probability score, this would produce repetitive, deterministic outputs. Instead, modern LLMs typically use techniques like temperature scaling or top-p sampling to introduce controlled randomness, resulting in more diverse and natural responses. ↩︎
Yekun Chai, et al, “Tokenization Falling Short: On Subword Robustness in Large Language Models” arXiv:2406.11687, section 2.1. ↩︎
Word-level tokenisation results in a large vocabulary as every word stemming from a root word is treated as a separate word (e.g. consider, considering, consideration). It also has difficulties handling languages that do not use white spaces to establish word boundaries (e.g. Chinese, Korean, Japanese) or languages that use compound words (e.g. German). ↩︎
WordPiece and Byte Pair Encoding are two common techniques used for subword tokenisation. ↩︎
To be clear, the LLM learns relationships and not explicit semantics or syntax. ↩︎
Definition of personal data in Singapore’s Personal Data Protection Act 2012, s 2 and UK GDPR, s 4(1). ↩︎
Depending on the information storage and retrieval system used, common data points could be stored as multiple copies (eg XML database) or in a code list (eg, spreadsheet or relational database). ↩︎
Note from the editor: This statement should be read primarily within the framework of Singapore’s Personal Data Protection Act. ↩︎
Masked language models (eg, BERT) are used for this, as these models are optimised to capture the semantic meaning of words and sentences better (but not textual generation). Masked language models enable semantic searches. ↩︎
The choice of distance metric can affect the results of the search. ↩︎
This paper benefited from reviewers who commented on earlier drafts. I wish to thank Pavandip Singh Wasan, Prof Lam Kwok Yan, Dr Ong Chen Hui and Rob van Eijk for their technical insight and very instructive comments; and Ms Chua Ying Hong, Jeffrey Lim and Dr Gabriela Zanfir-Fortuna for their very helpful suggestions. ↩︎