Consent for Processing Personal Data in the Age of AI: Key Updates Across Asia-Pacific

This Issue Brief summarizes key developments in data protection laws across the Asia-Pacific region since 2022, when the Future of Privacy Forum (FPF) and the Asian Business Law Institute (ABLI) published a series of reports examining 14 jurisdictions in the region. We found that while many offer alternative legal bases for data processing, consent remains the most widely used, often due to its familiarity, despite known limitations.

This Issue Brief provides an updated view of evolving consent requirements and alternative legal bases for data processing across key APAC jurisdictions: India, Vietnam, Indonesia, the Philippines, South Korea, and Malaysia.

In August 2023, India passed the Digital Personal Data Protection Act (DPDPA). Once in force, the DPDPA will provide a comprehensive framework for processing personal data. It affirms consent as the primary basis for processing but introduces structured obligations around notice, purpose limitation, and consent withdrawal, while enabling future flexibility for alternative legal bases.

Vietnam‘s Decree on Personal Data Protection took effect in July 2023. It sets clearer standards for consent while formally recognizing alternative legal bases, including for contractual necessity and legal obligations. This marks a key step in broadening lawful processing options for businesses.

Indonesia’s Personal Data Protection Law (PDPL), enacted in October 2022, introduces a unified national privacy law with an extended transition period. It affirms consent but also allows processing based on legitimate interest, public duties, and contract performance, bringing Indonesia closer to global privacy frameworks.

In November 2023, the PhilippinesNational Privacy Commission issued a Circular on Consent, clarifying valid consent standards and promoting transparency. The guidance aims to reduce consent fatigue by encouraging layered, contextual consent interfaces and outlines when consent may not be strictly necessary.

South Korea amended PIPA (in force since September 2023) and related guidelines promote easy-to-understand consent practices and recognize additional legal grounds, especially in the context of AI. A 2025 bill is under consideration to expand the use of non-consent bases for AI-related processing.

The Personal Data Protection (Amendment) Act 2024, published in October 2024, introduces stronger enforcement tools and administrative penalties in Malaysia. While the amendments do not change the legal bases for processing, they enhance the compliance environment and signal stricter oversight.

The Issue Brief also explores how the rise of AI is impacting shifts in lawmaking and policymaking across the region, when it comes to lawful grounds for processing personal data. 

As the APAC region shifts from fragmented, sector-specific rules to unified legal frameworks, understanding the evolving role of consent and the growing adoption of alternative legal bases is essential. From improving user-friendly consent mechanisms to strengthening enforcement and expanding lawful processing grounds, these changes highlight a more flexible and accountable approach to data protection across the region.

The Curse of Dimensionality: De-identification Challenges in the Sharing of Highly Dimensional Datasets

The 2006 release by AOL of search queries linked to individual users and the re-identification of some of those users is one of the best known privacy disasters in internet history. Less well known is that AOL had released the data to meet intense demand from academic researchers who saw this valuable data set as essential to understanding a wide range of human behavior. 

As the executive appointed AOL’s first Chief Privacy Officer as part of a strategy to help prevent further privacy lapses, the benefits as well as the risks of sharing data became a priority in my work. At FPF, our teams have worked on every aspect of enabling privacy safe data sharing for research and social utility, including de-identification1, the ethics of data sharing, privacy-enhancing technologies2 and more3.  Despite the skepticism of critics who maintain that reliable identification is a myth4, I maintain that it is hard, but for many data sets it is feasible, with the application of significant technical, legal and organizational controls. However, for highly dimensional data sets, or complex data sets that are made public or shared with multiple parties, the ability to provide strong guarantees at scale or without extensive impact on utility is far less feasible. 

1. Introduction

The Value and Risk of Search Query Data

Search query logs constitute an unparalleled repository of collective human interest, intent, behavior, and knowledge-seeking activities. As one of the most common activities on the web, searching generates data streams that paint intimate portraits of individual lives, revealing interests, needs, concerns, and plans over time5. This data holds immense potential value for a wide range of applications, including improving search relevance and functionality, understanding societal trends, advancing scientific research (e.g., in public health surveillance or social sciences), developing new products and services, and fueling the digital advertising ecosystem. 

However, the very richness that makes search data valuable also makes it exceptionally sensitive and fraught with privacy risks. Search queries frequently contain explicit personal information such as names, addresses, phone numbers, or passwords, often entered inadvertently by users. Beyond direct identifiers, queries are laden with quasi-identifiers (QIs) – pieces of information that, while not identifying in isolation, can be combined with other data points or external information to single out individuals. These can include searches related to specific locations, niche hobbies, medical conditions, product interests, or unique combinations of terms searched over time. Furthermore, the integration of search engines with advertising networks, user accounts, and other online services creates opportunities for linking search behavior with other extensive user profiles, amplifying the potential for privacy intrusions. The longitudinal nature of search logs, capturing behavior over extended periods, adds another layer of sensitivity, as sequences of queries can reveal evolving life circumstances, intentions, and vulnerabilities. The database reconstruction theorem, referred to as the fundamental law of information reconstruction, posits that publishing too much data derived from a confidential data source, at a high a degree of accuracy, will certainly after a finite number of queries result in the de-identification of the confidential data6. Extensive and extended releases of search data are a model example of this problem.

The De-identification Imperative and Its Inherent Challenges

Faced with the dual imperatives of leveraging valuable data and protecting user privacy, organizations rely heavily on data de-identification. De-identification encompasses a range of techniques aimed at removing or obscuring identifying information from datasets, thereby reducing the risk that the data can be linked back to specific individuals. The goal is to enable data analysis, research, and sharing while mitigating privacy harms and complying with legal and ethical obligations.

Despite its widespread use and appeal, de-identification is far from a perfected solution. Decades of research and numerous real-world incidents have demonstrated that supposedly “de-identified” or “anonymized” data have been re-identified, sometimes with surprising ease. This re-identification potential stems from several factors: the residual information left in the data after processing, the increasing availability of external datasets (auxiliary information) that can be linked to the de-identified data, and the continuous development of sophisticated analytical techniques. In some of these cases, a more rigorous de-identification process could have provided more effective protections, albeit with impact on the availability of the data needed.  In other cases, the impact of the de-identification might “only” be a threat to public figures7. In my experience, expert technical and legal teams can collaborate to support reasonable de-identification efforts for data that is well structured or closely held, but for complex, high-dimensional datasets or data shared broadly, the risks multiply.

Furthermore, the terminology itself is fraught with ambiguity. “De-identification” is often used as a catch-all term, but it can range from simple masking of direct identifiers (which offers weak protection) to more rigorous attempts at achieving true anonymity, where the risk of re-identification is negligible. This ambiguity can foster a false sense of security, as techniques that merely remove names or obvious identifiers have too often been labeled as “de-identified” while still leaving individuals vulnerable. Achieving a state where individuals genuinely cannot be reasonably identified is significantly harder, especially given the inherent trade-off between privacy protection and data utility: more aggressive de-identification techniques reduce re-identification risk but also diminish the data’s value for analysis. The concept of true, irreversible anonymization, where re-identification is effectively impossible, represents a high standard that is particularly challenging to meet for rich behavioral datasets, especially when data is shared with additional parties or made public. For more limited data sets that can be kept private and secure, or shared with extensive controls and legal and technical oversight, effective de-identification that maintains utility while reasonably managing risk can be feasible. This gap between the promise of de-identification and the persistent reality of re-identification risk for rich data sets that are shared lies at the heart of the privacy challenges discussed in this article.

Report Objectives and Structure

This article provides an analysis of the challenges associated with de-identifying massive datasets of search queries. It aims to review the technical, practical, legal, and ethical complexities involved. The analysis will cover:

  1. General De-identification Concepts and Techniques: Defining the spectrum of data protection methods and outlining common technical approaches.
  2. Unique Characteristics of Search Data: Examining the properties of search logs (dimensionality, sparsity, embedded identifiers, longitudinal nature) that make de-identification particularly difficult.
  3. The Re-identification Threat: Reviewing the mechanisms of re-identification attacks and landmark case studies (AOL, Netflix, etc.) where de-identification failed.
  4. Limitations of Techniques: Assessing the vulnerabilities and shortcomings of various de-identification methods when applied to search data.
  5. Harms and Ethics: Identifying the potential negative consequences of re-identification and exploring the ethical considerations surrounding user expectations, transparency, and consent.

The report concludes by synthesizing these findings to summarize the core privacy challenges, risks, and ongoing debates surrounding the de-identification of massive search query datasets.

2. Understanding Data De-identification

To analyze the challenges of de-identifying search queries, it is essential first to establish a clear understanding of the terminology and techniques involved in de-identification. The landscape includes various related but distinct concepts, each carrying different technical implications and legal weight.

Defining the Spectrum: De-identification, Anonymization, Pseudonymization8

The terms used to describe processes that reduce the linkability of data to individuals are often employed inconsistently, leading to confusion. 

Key De-identification Techniques and Mechanisms

A variety of techniques can be employed, often in combination, to achieve different levels of de-identification or anonymization. Each has distinct mechanisms, strengths, and weaknesses:

The following table provides a comparative overview of these techniques:

Table 1: Comparison of Common De-identification Techniques

Technique NameMechanism DescriptionPrimary GoalKey StrengthsKey Weaknesses/LimitationsApplicability to Search Logs
Suppression/ RedactionRemove specific values or recordsRemove specific identifiers/sensitive dataSimple; Effective for targeted removalHigh utility loss if applied broadly; Doesn’t address linkage via remaining dataLow (Insufficient alone; high utility loss for QIs)
MaskingObscure parts of data values (e.g., XXXX)Obscure direct identifiersSimple; Preserves formatLimited privacy protection; Can reduce utility; Hard for free textLow (Insufficient for QIs in queries)
GeneralizationReplace specific values with broader categoriesReduce identifiability via QIsBasis for k-anonymitySignificant utility loss, especially in high dimensions (“curse of dimensionality”)Low (Requires extreme generalization, destroying query meaning)
AggregationCombine data into summary statisticsHide individual recordsSimple; Useful for high-level trendsLoses individual detail; Vulnerable to differencing attacks ; Low utility for user-level analysisLow (Loses essential query sequence/context)
Noise AdditionAdd random values to data/resultsObscure true values; Enable DPBasis for DP; Provable guarantees possibleReduces accuracy/utility; Requires careful calibrationLow (Core of DP, but utility trade-off is key challenge, application to non-numeric fields like query text uncertain)
SwappingExchange values between recordsPreserve aggregates while perturbing recordsMaintains marginal distributionsIntroduces record-level inaccuracies; Complex implementation; Limited privacy guaranteeLow (Disrupts relationships within user history)
Hashing (Salted)Apply one-way function with unique salt per recordCreate non-reversible identifiersCan prevent simple lookups if salted properlyVulnerable if salt/key compromised; Doesn’t prevent linkage if hash is used as QILow (Hash of query text loses semantics; Hash of user ID is just pseudonymization)
PseudonymizationReplace identifiers with artificial codesAllow tracking/linking without direct IDsEnables longitudinal analysis; ReversibleStill personal data; High risk of pseudonym reversal/linkage, QIs remaining in data set create major risksLow (Allows user tracking, but privacy relies on pseudonym security/unlinkability)
k-AnonymityEnsure record indistinguishable among k based on QIsPrevent linkage via QIsIntuitive conceptFails in high dimensions; High utility loss; Vulnerable to homogeneity/background attacks; Not compositionalMedium (Impractical due to data characteristics)
l-Diversity / t-Closenessk-Anonymity variants adding sensitive attribute constraintsPrevent attribute disclosure within k-groupsStronger attribute protection than k-anonymityInherits k-anonymity issues; Adds complexity; Further utility reductionLow (Impractical due to k-anonymity’s base failure)
Differential Privacy (DP)Mathematical framework limiting inference about individuals via noiseProvable privacy guarantee against inference/linkageStrongest theoretical guarantees; Composable; Robust to auxiliary infoUtility/accuracy trade-off; Implementation complexity; Can be hard for complex queriesLow (Theoretically strongest, but practical utility for granular search data is a major hurdle)
Synthetic DataGenerate artificial data mimicking original statisticsProvide utility without real recordsCan avoid direct disclosure of real dataHard to ensure utility & privacy simultaneously; Risk of memorization/inference if model overfits; Bias amplificationMedium (Promising, but technically demanding for complex behavioral data like search, future potential, but research still early)

3. The Unique Nature and Privacy Sensitivity of Search Query Data

Search query data possesses several intrinsic characteristics that make it particularly challenging to de-identify effectively while preserving its analytical value. These properties distinguish it from simpler, structured datasets often considered in introductory anonymization examples.

High Dimensionality, Sparsity, and the “Curse of Dimensionality”

Search logs are inherently high-dimensional datasets. Each interaction potentially captures a multitude of attributes associated with a user or session: the query terms themselves, the timestamp of the query, the user’s IP address (providing approximate location), browser type and version, operating system, language settings, cookies or other identifiers linking sessions, the rank of clicked results, the URL or domain of clicked results, and potentially other contextual signals. When viewed longitudinally, the sequence of these interactions adds further dimensions representing temporal patterns and evolving interests.

Simultaneously, individual user data within this high-dimensional space is typically very sparse. Any single user searches for only a tiny fraction of all possible topics or keywords, clicks on a minuscule subset of the web’s pages, and exhibits specific patterns of activity at particular time17.

This combination of high dimensionality and sparsity poses a fundamental challenge known as the “curse of dimensionality18” in the context of data privacy. In high-dimensional spaces, data points tend to become isolated; the concept of a “neighbor” or “similar record” becomes less meaningful because points are likely to differ across many dimensions19. Consequently, even without explicit identifiers, the unique combination of attributes and behaviors across many dimensions can act as a distinct “fingerprint” for an individual user. This uniqueness makes re-identification through linkage or inference significantly easier.

The curse of dimensionality challenges traditional anonymization techniques like k-anonymity20. Since k-anonymity relies on finding groups of at least k individuals who are identical across all quasi-identifying attributes, the sparsity and uniqueness inherent in high-dimensional search data make finding such groups highly improbable without resorting to extreme measures. To force records into equivalence classes, one would need to apply such broad generalization (e.g., reducing detailed query topics to very high-level categories) or suppress so much data that the resulting dataset loses significant analytical value. 

Implicit Personal Identifiers and Quasi-Identifiers in Queries

Beyond the metadata associated with a search (IP, timestamp, etc.), the content of the search queries themselves is a major source of privacy risk.  Firstly, users frequently, though often unintentionally, include direct personal information within their search queries. This could be their own name, address, phone number, email address, social security number, account numbers, or similar details about others. The infamous AOL search log incident provided stark evidence of this, where queries directly contained names and location information that facilitated re-identification.  Secondly, and perhaps more pervasively, search queries are rich with quasi-identifiers (QIs). These are terms, phrases, or concepts that, while not uniquely identifying on their own, become identifying when combined with each other or with external auxiliary information. Examples abound in the search context:

The challenge lies in the unstructured, free-text nature of search queries. Unlike structured databases where QIs like date of birth, gender, and ZIP code often reside in well-defined columns, the QIs in search queries are embedded within the semantic meaning and contextual background of the text string itself. Identifying and removing or generalizing all such potential QIs automatically is an extremely difficult task, particularly if done at large scale and by automated means. Standard natural language processing techniques might identify common entities like names or locations, but would struggle with the vast range of potentially identifying combinations and context-dependent sensitivities. Passwords or coded unique urls of private documents may be entered by users and impossible to recognize for automated redaction. This inherent difficulty in scrubbing QIs from unstructured query text makes search data significantly harder to de-identify reliably compared to structured data.

Temporal Dynamics and Longitudinal Linkability

Search logs are not static snapshots; they are longitudinal records capturing user behavior as it unfolds over time. A user’s search history represents a sequence of actions, reflecting evolving interests, ongoing tasks, changes in location, and shifts in life circumstances. This temporal dimension adds significant identifying power beyond that of individual, isolated queries.

Even if session-specific identifiers like cookies are removed or periodically changed, the continuity of a user’s behavior can allow for linking queries across different sessions or time periods. Consistent patterns (e.g., regularly searching for specific technical terms related to one’s profession), evolving interests (e.g., searches related to pregnancy progressing over months), or recurring needs (e.g., checking commute times) can serve as anchors to connect seemingly disparate query records back to the same individual. The sequence itself becomes a quasi-identifier.  This poses a significant challenge for de-identification. Techniques applied cross-sectionally—treating each query or session independently—may fail to protect against longitudinal linkage attacks that exploit these behavioral trails. Effective de-identification of longitudinal data requires considering the entire user history, or at least sufficiently long windows of activity, to assess and mitigate the risk of temporal linkage. This inherently increases the complexity of the de-identification process and potentially necessitates even greater data perturbation or suppression to break these temporal links, further impacting utility. Anonymization techniques that completely sever links between records over time would prevent valuable longitudinal analysis altogether.

The Uniqueness and Re-identifiability Potential of Search Histories

The combined effect of high dimensionality, sparsity, embedded quasi-identifiers, and temporal dynamics results in search histories that are often highly unique to individual users. Research has repeatedly shown that even limited sets of behavioral data points can uniquely identify individuals within large populations. Latanya Sweeney’s seminal work demonstrated that 87% of the US population could be uniquely identified using just three quasi-identifiers: 5-digit ZIP code, gender, and full date of birth21. Search histories contain far more dimensions and potentially identifying attributes than this minimal set.

Studies on analogous high-dimensional behavioral datasets confirm this potential for uniqueness and re-identification. The successful de-anonymization of Netflix users based on a small number of movie ratings linked to public IMDb profiles is a prime example. Similarly, research has shown high re-identification rates for mobile phone location data and credit card transactions, purely based on the patterns of activity. Su and colleagues showed that de-identified web browsing histories can be linked to social media profiles using only publicly available data22. Given that search histories encapsulate a similarly rich and diverse set of user actions and interests over time, it is highly probable that many users possess unique or near-unique search “fingerprints” even after standard de-identification techniques (like removing IP addresses and user IDs) are applied. This inherent uniqueness makes search logs exceptionally vulnerable to re-identification, particularly through linkage attacks that correlate the de-identified search patterns with other available data sources. The simple assumption that removing direct identifiers is sufficient to protect privacy is demonstrably false for this type of rich, behavioral data. The very detail that makes search logs valuable for understanding behavior also makes them inherently difficult to anonymize effectively.

4. The Re-identification Threat: Theory and Practice

The potential for re-identification is not merely theoretical; it is a practical threat demonstrated through various attack methodologies and real-world incidents. Understanding these mechanisms is crucial for appreciating the limitations of de-identification for search query data.

Mechanisms of Re-identification: Linkage, Inference, and Reconstruction Attacks

Re-identification attacks exploit residual information in de-identified data or leverage external knowledge to uncover identities or sensitive attributes. Key mechanisms include:

The threat landscape for re-identification is diverse and evolving. While linkage attacks relying on external data remain a primary concern, inference and reconstruction attacks, potentially powered by advanced AI/ML techniques, pose growing risks even to datasets processed with sophisticated methods. This necessitates robust privacy protections that anticipate a wide range of potential attack vectors.

Landmark Case Study: The AOL Search Log Release (2006)

In August 2006, AOL publicly released a dataset containing approximately 20 million search queries made by over 650,000 users during a three-month period. The data was intended for research purposes and was presented as “anonymized.” The primary anonymization step involved replacing the actual user identifiers with arbitrary numerical IDs. However, the dataset retained the raw query text, query timestamps, and information about clicked results (rank and domain URL). Later statements suggest IP address and cookie information were also altered, though potentially insufficiently.

The attempt at anonymization failed dramatically and rapidly. Within days, reporters Michael Barbaro and Tom Zeller Jr. of The New York Times were able to re-identify one specific user, designated “AOL user No. 4417749,” as Thelma Arnold, a 62-year-old widow living in Lilburn, Georgia23. They achieved this by analyzing the sequence of queries associated with her user number. The queries contained a potent mix of quasi-identifiers, including searches for “landscapers in Lilburn, Ga,” searches for individuals with the surname “Arnold,” and searches for “homes sold in shadow lake subdivision gwinnett county georgia,” alongside other personally revealing (though not directly identifying) queries like “numb fingers,” “60 single men,” and “dog that urinates on everything.” The combination of these queries created a unique pattern easily traceable to Ms. Arnold through publicly available information.

The AOL incident became a watershed moment in data privacy. It starkly demonstrated several critical points relevant to search data de-identification:

  1. Removing explicit user IDs is fundamentally insufficient when the underlying data itself contains rich identifying information.
  2. Search queries, even seemingly innocuous ones, are laden with Personally Identifiable Information (PII) and powerful quasi-identifiers embedded in the text.
  3. The temporal sequence of queries provides crucial context and significantly increases identifiability.
  4. Linkage attacks using query content combined with publicly available information are feasible and effective.
  5. Simple anonymization techniques fail to account for the identifying power of combined attributes and behavioral patterns.

The incident led to significant public backlash, the resignation of AOL’s CTO, and a class-action lawsuit. It remains a canonical example of the pitfalls of naive de-identification and the unique sensitivity of search query data.

Landmark Case Study: The Netflix Prize De-anonymization (2007-2008)

In 2006, Netflix launched a public competition, the “Netflix Prize,” offering $1 million to researchers who could significantly improve the accuracy of its movie recommendation system. To facilitate this, Netflix released a large dataset containing approximately 100 million movie ratings (1-5 stars, plus date) from nearly 500,000 anonymous subscribers, collected between 1998 and 2005. User identifiers were replaced with random numbers, and any other explicit PII was removed.

In 2007, researchers Arvind Narayanan and Vitaly Shmatikov published a groundbreaking paper demonstrating how this supposedly anonymized dataset could be effectively de-anonymized24. Their attack relied on linking the Netflix data with a publicly available auxiliary dataset: movie ratings posted by users on the Internet Movie Database (IMDb).

They developed statistical algorithms that could match users across the two datasets based on shared movie ratings and the approximate dates of those ratings. Their key insight was that while many users might rate popular movies similarly, the combination of ratings for less common movies, along with the timing, created unique signatures. They showed that an adversary knowing only a small subset (as few as 2, but more reliably 6-8) of a target individual’s movie ratings and approximate dates could, with high probability, uniquely identify that individual’s complete record within the massive Netflix dataset. Their algorithm was robust to noise, meaning the adversary’s knowledge didn’t need to be perfectly accurate (e.g., dates could be off by weeks, ratings could be slightly different).

Narayanan and Shmatikov successfully identified the Netflix records corresponding to several non-anonymous IMDb users, thereby revealing their potentially private Netflix viewing histories, including ratings for sensitive or politically charged films that were not part of their public IMDb profiles.

The Netflix Prize de-anonymization study had significant implications:

  1. It demonstrated the vulnerability of high-dimensional, sparse datasets (characteristic of much behavioral data, including search logs) to linkage attacks.
  2. It proved that even seemingly non-sensitive data (movie ratings) can become identifying when combined with auxiliary information.
  3. It highlighted the inadequacy of simply removing direct identifiers and replacing them with pseudonyms when dealing with rich datasets.
  4. It underscored the power of publicly available auxiliary data in undermining anonymization efforts.

The research led to a class-action lawsuit against Netflix alleging privacy violations and the subsequent cancellation of a planned second Netflix Prize competition due to privacy concerns raised by the Federal Trade Commission (FTC). It remains a pivotal case study illustrating the fragility of anonymization for behavioral data.

Other Demonstrations of Re-identification Across Data Types

The AOL and Netflix incidents are not isolated cases. Numerous studies and breaches have demonstrated the feasibility of re-identifying individuals from various types of supposedly de-identified data, reinforcing the systemic nature of the challenge, especially for rich, individual-level records.

The following table summarizes some of these key incidents:

Table 2: Summary of Notable Re-identification Incidents

Incident Name/YearData Type“Anonymization” Method UsedRe-identification MethodAuxiliary Data UsedKey Finding/Significance
MA Governor Weld (1990s)Hospital Discharge DataRemoval of direct identifiers (name, address, SSN)Linkage AttackPublic Voter Registration List (ZIP, DoB, Gender)Early demonstration that QIs in supposedly de-identified data allow linkage to identified data.
AOL Search Logs (2006)Search QueriesUser ID replaced with number; Query text, timestamps retainedLinkage/Inference from Query ContentPublic knowledge, location directoriesSearch queries themselves contain rich PII/QIs enabling re-identification. Simple ID removal is insufficient.
Netflix Prize (2007-8)Movie Ratings (user, movie, rating, date)User ID replaced with numberLinkage AttackPublic IMDb User RatingsHigh-dimensional, sparse behavioral data is vulnerable. Small amounts of auxiliary data can enable re-id.
NYC Taxis (2014)Taxi Trip Records (incl. hashed medallion/license)Weak (MD5) hashing of identifiersPseudonym Reversal (Hash cracking)Knowledge of hashing algorithmPoorly chosen pseudonymization (weak hashing) is easily reversible.
Australian Health Records (MBS/PBS) (2016)Medical Billing DataClaimed de-identification (details unclear)Linkage AttackPublicly available information (e.g., birth year, surgery dates)Government-released health data, claimed anonymous, was re-identifiable.
Browsing History / Social Media Web Browsing HistoryAssumed de-identified (focus on linking)Linkage AttackSocial Media Feeds (e.g., Twitter)Unique patterns of link clicking in browsing history mirror unique social feeds, enabling linkage.
Genomic Beacons (Various studies)Aggregate Genomic Data (allele presence/absence)Query interface limits information releaseMembership Inference Attack (repeated queries, linkage)Individual’s genome sequence, Genealogical databasesEven aggregate or restricted-query genomic data can leak membership information.
Credit Card Data (de Montjoye et al. 2015)Transaction Records (merchant, time, amount)Assumed de-identifiedUniqueness Analysis / Linkage(Implicit) External knowledge correlating purchases/locationsSparse transaction data is highly unique; few points needed for re-identification.
Location Data (Various studies)Mobile Phone Location TracesVarious (often simple ID removal or aggregation)Uniqueness Analysis / Linkage AttackMaps, Points of Interest, Public RecordsHuman mobility patterns are highly unique; location data is easily re-identifiable..

These examples collectively illustrate that re-identification is not a niche problem confined to specific data types but a systemic risk inherent in sharing or releasing granular data about individuals, especially when that data captures complex behaviors over time or across multiple dimensions. Search query logs share many characteristics with these vulnerable datasets (high dimensionality, sparsity, behavioral patterns, embedded QIs, longitudinal nature), strongly suggesting they face similar, if not greater, re-identification risks.

The Critical Role of Auxiliary Information

A recurring theme across nearly all successful re-identification demonstrations is the crucial role played by auxiliary information. This refers to any external data source or background knowledge an attacker possesses or can obtain about individuals, which can then be used to bridge the gap between a de-identified record and a real-world identity.

The sources of auxiliary information are vast and continuously expanding in the era of Big Data:

The critical implication is that the privacy risk associated with a de-identified dataset cannot be assessed in isolation. Its vulnerability depends heavily on the external data ecosystem and what information might be available for linkage. De-identification performed today might be broken tomorrow as new auxiliary data sets become available or linkage techniques improve. This makes robust anonymization a moving target. Any assessment of re-identification risk must therefore be contextual, considering the specific data being released, the intended recipients or release environment, and the types of auxiliary information reasonably available to potential adversaries. Relying solely on removing identifiers without considering this broader context creates a fragile and likely inadequate privacy protection strategy.

5. Limitations of De-identification Techniques on Search Data

Given the unique characteristics of search query data and the demonstrated power of re-identification attacks, it is essential to critically evaluate the limitations of specific de-identification techniques when applied to this context.

The Fragility of k-Anonymity in High-Dimensional, Sparse Data

As established in Section 3.1, k-anonymity aims to protect privacy by ensuring that any individual record in a dataset is indistinguishable from at least k-1 other records based on their quasi-identifier (QI) values. This is typically achieved through generalization (making QI values less specific) and suppression (removing records or values).

However, k-anonymity proves fundamentally ill-suited for high-dimensional and sparse datasets like search logs. The core problem lies in the “curse of dimensionality”:

  1. Uniqueness: In datasets with many attributes (dimensions), individual records tend to be unique or nearly unique across the combination of those attributes. Finding k search users who have matching patterns across numerous QIs (specific query terms, timestamps, locations, click behavior, etc.) is highly improbable.
  2. Utility Destruction: To force records into equivalence classes of size k, massive amounts of generalization or suppression are required. Generalizing query terms might mean reducing specific searches like “side effects of lisinopril” to a broad category like “health query,” destroying the semantic richness crucial for analysis. Suppressing unique or hard-to-group records could eliminate vast portions of the dataset. This results in an unacceptable level of information loss, potentially rendering the data useless for its intended purpose.
  3. Vulnerability to Attacks: Even if k-anonymity is technically achieved, it remains vulnerable. The homogeneity attack occurs if all k records in a group share the same sensitive attribute (e.g., all searched for the same sensitive topic), revealing that attribute for anyone linked to the group. Background knowledge attacks can allow adversaries to further narrow down possibilities within a group.

Refinements like l-diversity and t-closeness attempt to address attribute disclosure vulnerabilities by requiring diversity or specific distributional properties for sensitive attributes within each group. However, they inherit the fundamental problems of k-anonymity regarding high dimensionality and utility loss, while adding implementation complexity. Furthermore, k-anonymity lacks robust compositionality; combining multiple k-anonymous releases does not guarantee privacy. Therefore, k-anonymity and its derivatives face challenges when used for de-identifying massive, complex search logs. They force difficult choices between retaining minimal utility or providing inadequate privacy protection against linkage and inference attacks.

Differential Privacy: The Utility-Privacy Trade-off and Implementation Hurdles

Differential Privacy (DP) offers a fundamentally different approach, providing mathematically rigorous, provable privacy guarantees29. Instead of modifying data records directly to achieve indistinguishability, DP focuses on the output of computations (queries, analyses, models) performed on the data. It ensures that the result of any computation is statistically similar whether or not any single individual’s data is included in the input dataset. This is typically achieved by adding carefully calibrated random noise to the computation’s output.

DP’s strengths are significant: its guarantees hold regardless of an attacker’s auxiliary knowledge, and privacy loss (quantified by \epsilon and \delta) composes predictably across multiple analyses. However, applying DP effectively to massive search logs presents substantial challenges:

  1. Applicability to Complex Queries and Data Types: DP is well-understood for basic aggregate queries (counts, sums, averages, histograms) on numerical or categorical data. Applying it effectively to the complex structures and query types relevant to search logs—such as analyzing free-text query semantics, mining sequential patterns in user sessions, building complex machine learning models (e.g., for ranking or recommendations), or analyzing graph structures (e.g., click graphs)—is more challenging and an active area of research. Standard DP mechanisms might require excessive noise or simplification for such tasks. Techniques like DP-SGD (Differentially Private Stochastic Gradient Descent) exist for training models, but again involve utility trade-offs30.
  1. The Utility-Privacy Trade-off31: This is the most fundamental challenge. The strength of the privacy guarantee (lower \epsilon) is inversely proportional to the amount of noise added. More noise provides better privacy but reduces the accuracy and utility of the results. For the complex, granular analyses often desired from search logs (e.g., understanding rare query patterns, analyzing specific user journeys, training accurate prediction models), the amount of noise required to achieve a meaningful level of privacy (a small \epsilon) might overwhelm the signal, rendering the results unusable. While DP performs better on larger datasets where individual contributions are smaller, the sensitivity of queries on sparse, high-dimensional data can still necessitate significant noise. Finding an acceptable balance between privacy and utility for diverse use cases remains a major hurdle.
  1. Implementation Complexity and Correctness: Implementing DP correctly requires significant expertise in both the theory and the practical nuances of noise calibration, sensitivity analysis (bounding how much one individual can affect the output), and privacy budget management. Errors in implementation, such as underestimating sensitivity or mismanaging the privacy budget across multiple queries (due to composition rules), can silently undermine the promised privacy guarantees. Defining the “privacy unit” (e.g., user, query, session) appropriately is critical; misclassification can lead to unintended disclosures. Auditing DP implementations for correctness is also non-trivial.
  1. Local vs. Central Models: DP can be implemented in two main models. In the central model, a trusted curator collects raw data and then applies DP before releasing results. This generally allows for higher accuracy (less noise for a given \epsilon) but requires users to trust the curator with their raw data. In the local model (LDP), noise is added on the user’s device before data is sent to the collector. This offers stronger privacy guarantees as the collector never sees raw data, but typically requires significantly more noise to achieve the same level of privacy, often leading to much lower utility. The choice of model impacts both trust assumptions and achievable utility.

In essence, while DP provides the gold standard in theoretical privacy guarantees, its practical application to the scale and complexity of  search logs involves significant compromises in data utility and faces non-trivial implementation hurdles. It is not a simple “plug-and-play” solution for making granular search data both private and fully useful.

Inadequacies of Aggregation, Masking, and Generalization for Search Logs

Simpler, traditional de-identification techniques prove largely insufficient for protecting privacy in search logs while preserving meaningful utility:

These foundational techniques, while potentially useful as components within a more sophisticated strategy (e.g., aggregation combined with differential privacy), are individually incapable of addressing the complex privacy challenges posed by massive search query datasets without sacrificing the data’s core value.  As we discuss further, even combined they fall short.

Challenges with Synthetic Data Generation for Complex Behavioral Data

Generating synthetic data—artificial data designed to mirror the statistical properties of real data without containing actual individual records—has emerged as a promising privacy-enhancing technology. It offers the potential to share data insights without sharing real user information. However, creating high-quality, privacy-preserving synthetic search logs faces significant hurdles32:

  1. Utility Preservation: Search logs capture complex patterns: semantic relationships between query terms, sequential dependencies in user sessions, temporal trends, correlations between queries and clicks, and vast individual variability. Training a generative model (e.g., a statistical model or a deep learning model like an LLM) to accurately capture all these nuances without access to the original data is extremely challenging. If the synthetic data fails to replicate these properties faithfully, it will have limited utility for downstream tasks like training accurate machine learning models or conducting reliable behavioral research. Generating realistic sequences of queries that maintain semantic coherence and plausible user intent is particularly difficult.
  2. Privacy Risks (Memorization and Inference): Generative models, especially large and complex ones like LLMs, run the risk of “memorizing” or “overfitting” to their training data. If this happens, the model might generate synthetic examples that are identical or very close to actual records from the sensitive training dataset, thereby leaking private information. This risk is often higher for unique or rare records (outliers) in the original data. Even if exact records aren’t replicated, the synthetic data might still be vulnerable to membership inference attacks, where an attacker tries to determine if a specific person’s data was used to train the generative model. Ensuring the generation process itself is privacy-preserving, for example by using DP during model training is crucial but adds complexity and can impact the fidelity (utility) of the generated data. Evaluating the actual privacy level achieved by synthetic data is also a complex task.
  3. Bias Amplification: Generative models learn patterns from the data they are trained on. If the original search log data contains societal biases (e.g., stereotypical associations, skewed representation of demographic groups), the synthetic data generated is likely to replicate, and potentially even amplify, these biases. This can lead to unfair or discriminatory outcomes if the synthetic data is used for training downstream applications.

Therefore, while synthetic data holds promise, generating truly useful and private synthetic search logs is a frontier research problem. The very complexity that makes search data valuable also makes it incredibly difficult to synthesize accurately without inadvertently leaking information or perpetuating biases. It requires sophisticated modeling techniques combined with robust privacy-preserving methods like DP integrated directly into the generation workflow.

6. Harms, Ethics, and Societal Implications

The challenges of de-identifying search query data are not merely technical or legal; they extend into architectural and organizational domains that fundamentally shape privacy outcomes. How data is released—through what mechanisms, under what controls, and with what oversight—represents an architectural problem bound by organizational principles and norms. The key architectural building block lies in the design of APIs (Application Programming Interfaces), which can act as critical shields between raw data and external access. Re-identification attempts can be partially mitigated at the API level through strict query limits, access controls, auditing mechanisms, and purpose restrictions—complementing the privacy-enhancing technologies discussed throughout this paper. These architectural choices embed ethical values and reflect organizational commitments to privacy beyond mere technical implementation. They carry significant weight and potential for real-world harm if privacy is compromised. These controls can perhaps be observed and managed at an individual organizational level, with extensive oversight and a data protection legal regime including enforcement in place, but are challenging to envision for ongoing large scale access to data by multiple unrelated independent parties.  Once data is released, it is beyond the control of the API.  Cutting off future API access when multiple releases create a re-identification risk may not be feasible.  Knowing whether multiple API users collaborate or combine data is also a limitation.

Potential Harms from Re-identified Search Data: From Embarrassment to Discrimination

If supposedly de-identified search query data is successfully re-linked to individuals, the consequences can range from personal discomfort to severe, tangible harms. Search histories can reveal extremely sensitive aspects of a person’s life, including:

The exposure of such information through re-identification can lead to a spectrum of harms:

These potential harms underscore the high stakes involved in handling search query data. The impact extends beyond individual privacy violations to potential societal harms, such as reinforcing existing inequalities through discriminatory profiling or undermining trust in digital services. Critically, legal systems often struggle to recognize and provide remedies for many of these harms, particularly those that are non-financial, cumulative, or relate to future risks.

7. Conclusion: Synthesizing the Challenges and Risks

The de-identification of massive search query datasets presents a complex and formidable challenge, sitting at the intersection of immense data value and profound privacy risk. While the potential benefits of analyzing search behavior for societal good, service improvement, and innovation are undeniable, the inherent nature of this data makes achieving meaningful privacy protection through de-identification exceptionally difficult.

The Core Privacy Paradox of Search Data De-identification

The fundamental paradox lies in the richness of the data itself. Search logs capture a high-dimensional, sparse, and longitudinal record of human intent and behavior. This richness, containing myriad explicit and implicit identifiers and quasi-identifiers embedded within unstructured query text and temporal patterns, creates unique individual fingerprints. Consequently, techniques designed to obscure identity often face a stark trade-off: either they fail to adequately protect against re-identification attacks (especially linkage attacks leveraging the vast ecosystem of auxiliary data ), or they must apply such aggressive generalization, suppression, or noise addition that the data’s analytical utility is severely compromised.

Traditional methods like k-anonymity are fundamentally crippled by the “curse of dimensionality” inherent in this data type. More advanced techniques like differential privacy offer stronger theoretical guarantees but introduce significant practical challenges related to the privacy-utility balance, implementation complexity, and applicability to the diverse analyses required for search data. Synthetic data generation, while promising, faces similar difficulties in capturing complex behavioral nuances without leaking information or amplifying bias.

Summary of Key Risks and Vulnerabilities

The analysis presented in this report highlights several critical risks associated with attempts to de-identify  search query data:

  1. High Re-identification Risk: Due to the data’s uniqueness and the power of linkage attacks using auxiliary information, the risk of re-identifying individuals from processed search logs remains substantial. Landmark failures like the AOL and Netflix incidents serve as potent warnings.
  2. Inadequacy of Simple Techniques: Basic methods like removing direct identifiers, masking, simple aggregation, or naive generalization are insufficient to protect against sophisticated attacks on this type of data.
  3. Limitations of Advanced Techniques: Even state-of-the-art methods like differential privacy and synthetic data generation face significant hurdles in balancing provable privacy with practical utility for complex, granular search data analysis.
  4. Evolving Threat Landscape: The continuous growth of available data and the increasing sophistication of analytical techniques, including AI/ML-driven attacks, mean that re-identification risks are dynamic and likely increasing over time.
  5. Potential for Serious Harm: Re-identification can lead to tangible harms, including discrimination, financial loss, reputational damage, psychological distress, and chilling effects on free expression and inquiry.

The Ongoing Debate

The challenges outlined fuel an ongoing debate about the viability and appropriate role of de-identification in the context of large-scale behavioral data. While organizations invest in Privacy Enhancing Technologies (PETs) and implement policies aimed at protecting user privacy, the demonstrable risks and technical limitations suggest that achieving true, robust anonymity for granular search query data, while maintaining high utility, remains an elusive goal.

During the preparation of this work the author used ChatGPT to reword and rephrase text and for a first draft of the two charts in the document. After using this tool/service, the author reviewed and edited the content as needed and takes full responsibility for the content of the publication.

  1. https://fpf.org/issue/deid/ ↩︎
  2. https://fpf.org/tag/privacy-enhancing-technologies/ ↩︎
  3.  https://fpf.org/issue/research-and-ethics/ ↩︎
  4. Ohm: https://heinonline.org/HOL/LandingPage?handle=hein.journals/uclalr57&div=48&id=&page= ↩︎
  5. Cooper: https://citeseerx.ist.psu.edu/document? ↩︎
  6. Dinur, Nissim: https://weizmann.elsevierpure.com/en/publications/revealing-information-while-preserving-privacy ↩︎
  7. Barth-Jones: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2076397 ↩︎
  8. Polonetsky, Tene and Finch: https://digitalcommons.law.scu.edu/cgi/viewcontent.cgi?article=2827&context=lawreview ↩︎
  9. We note the European Court of Justice Breyer decision and subsequent EU court decisions that may open up a legal argument that it may be possible to consider a party that does not reasonably have potential access to the additional data to be in possession of non-personal data. https://curia.europa.eu/juris/document/document.jsf?docid=184668&doclang=EN ↩︎
  10. Sweeney: https://www.hks.harvard.edu/publications/k-anonymity-model-protecting-privacy
    ↩︎
  11. Aggarwal, Charu C. (2005). “On k-Anonymity and the Curse of Dimensionality”. VLDB ’05 – Proceedings of the 31st International Conference on Very large Data Bases. Trondheim, Norway. CiteSeerX 10.1.1.60.3155 ↩︎
  12. Marcus Olson:https://marcusolsson.dev/k-anonymity-and-l-diversity/ ↩︎
  13. Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian, “t-Closeness: Privacy Beyond k-Anonymity and ℓ-Diversity,” Proceedings of the 23rd IEEE International Conference on Data Engineering (2007 ↩︎
  14. Dwork, C. (2006). Differential Privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds) Automata, Languages and Programming. ICALP 2006. Lecture Notes in Computer Science, vol 4052. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11787006_1 ↩︎
  15. Simson Garfinkel NIST SP 800 ↩︎
  16. https://research.google/blog/protecting-users-with-differentially-private-synthetic-training-data/ ↩︎
  17. https://sparktoro.com/blog/who-sends-traffic-on-the-web-and-how-much-new-research-from-datos-sparktoro/ ↩︎
  18. Mitigating the Curse of Dimensionality in Data Anonymization – CRISES / URV, https://crises-deim.urv.cat/web/docs/publications/lncs/1084.pdf 59 ↩︎
  19. Bellman: https://link.springer.com/referenceworkentry/10.1007/978-0-387-39940-9_133 ↩︎
  20. On k-anonymity and the curse of dimensionality, https://www.vldb.org/archives/website/2005/program/slides/fri/s901-aggarwal.pdf ↩︎
  21. Latanya Sweeney, “Uniqueness of Simple Demographics in the U.S. Population,” Carnegie Mellon University, Data Privacy Working Paper 3, 2000 ↩︎
  22. Su, Goel, Shukla, Narayana https://www.cs.princeton.edu/~arvindn/publications/browsing-history-deanonymization.pdf ↩︎
  23. Michael Barbaro and Tom Zeller Jr., “A Face Is Exposed for AOL Searcher No. 4417749,” The New York Times, August 9, 2006 ↩︎
  24. Shmatikov How To Break Anonymity of the Netflix Prize Dataset. arxiv cs/0610105 ↩︎
  25. Systematic Review of Re-Identification Attacks on Health Data – PMC, https://pmc.ncbi.nlm.nih.gov/articles/PMC3229505/ 115 ↩︎
  26. https://medium.com/vijay-pandurangan/of-taxis-and-rainbows-f6bc289679a1 ↩︎
  27. https://dspace.mit.edu/handle/1721.1/96321 ↩︎
  28. https://www.cs.princeton.edu/~arvindn/publications/browsing-history-deanonymization.pdf ↩︎
  29. Cynthia Dwork, “Differential Privacy,” in Automata, Languages and Programming, 33rd International Colloquium, ICALP 2006, Proceedings, Part II, ed. Michele Bugliesi et al., Lecture Notes in Computer Science 4052 (Berlin: Springer, 2006) ↩︎
  30. https://research.google/blog/generating-synthetic-data-with-differentially-private-llm-inference/ ↩︎
  31. Guidelines for Evaluating Differential Privacy Guarantees – NIST Technical Series Publications, https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-226.pdf ↩︎
  32. Privacy Tech-Know blog: When what is old is new again – The reality of synthetic data, https://www.priv.gc.ca/en/blog/20221012/ 95 ↩︎

FPF Launches Major Initiative to Study Economic and Policy Implications of AgeTech

FPF and University of Arizona Eller College of Management Awarded Grant by Alfred P. Sloan Foundation to Address Privacy Implications, and Data Uses of Technologies Aimed at Aging At Home

The Future of Privacy Forum (FPF) — a global non-profit focused on data protection, AI and emerging technologies–has been awarded a grant from the Alfred P. Sloan Foundation to lead a two-year research project entitled Aging at Home: Caregiving, Privacy, and Technology, in partnership with the University of Arizona Eller College of Management. The project, which launched on April 1, will explore the complex intersection of privacy, economics, and the use of emerging technologies designed to support aging populations (“AgeTech”). AgeTech includes a wide range of applications and technologies, from fall detection devices and health monitoring apps to artificial intelligence (AI)-powered assistants.

As of 2024, older adults out number children in almost half of U.S. counties with projections that about one in five Americans will be age 65 or older by 2034 (a year sooner than originally estimated.) This rapidly aging population presents complex challenges and opportunities, particularly in the increased demand for resources necessary for senior care and the use of AgeTech to promote improved autonomy and independence.

FPF will lead rigorous, independent research into these issues, with a particular focus on the privacy expectations of seniors and caregivers, cost barriers to adoption, and the policy gaps surrounding AgeTech. The research will include experimental surveys, roundtables with industry and policy leaders, and a systematic review of economic and privacy challenges facing AgeTech solutions.

The project will be led by co-principals Jules Polonetsky, CEO of FPF, and Dr. Laura Brandimarte, Associate Professor of Management Information Systems at the University of Arizona Eller College of Management. Polonetsky is an internationally recognized privacy expert and co-editor of the Cambridge Handbook on Consumer Privacy. Brandimarte’s work focused on the ethics of technology, with an emphasis on privacy and security, uses quantitative methods including survey and experimental design, and econometric data analysis.  

Jordan Wrigley, a data and policy analyst who leads FPF health data research, will play a lead role for FPF along with members of FPF’s U.S., Global, and AI Policy teams.  Jordan is a recognized and awarded health meta-analytic methodologist and researcher, whose work has informed medical care guidelines and AI data practices.

“The privacy aspects of AgeTech, such as consent and authorization, data sensitivity, and cost, need to be studied and considered holistically to create sustainable policies and build trust with seniors and caregivers as the future of aging becomes the present,” said Wrigley. “This research will seek to do just that.”

“At FPF, we believe that technology and data can benefit society and improve lives when the right laws, policies, and safeguards are in place,” added Polonetsky. “The goal of AgeTech – to assist seniors in living independently while reducing healthcare costs and caregiving burdens – impacts us all. As this field grows, it’s essential that we have the right rules in place to protect privacy and preserve dignity.”

“Technology has the potential to increase the autonomy and overall wellbeing of an ageing population, but for that to happen there has to be trust on the part of users – both that the technology will effectively be of assistance and that it will not constitute another source of data privacy and security intrusions,” added Brandimarte. “We currently know very little about the level of trust the elderly place in AgingTech and the specific needs of this at-risk population when they interact with it, including data accessibility by family members or caregivers.”

Dr. Daniel Goroff, Vice President and Program Director for Sloan, agrees, “As AgeTech evolves, it brings enormous promise—along with pressing questions about equity, access, and privacy. This initiative will provide insights about how innovations can ethically and responsibly enhance the autonomy and dignity of older adults. We’re excited to see FPF and the University of Arizona leading the way on this timely research.”

Key project outputs will include:

Sign-up for our mailing list to stay informed about future progress, and reach out to Jordan Wrigley ([email protected]) if you are interested in learning more about the project. 

Aging at Home: Caregiving, Privacy, and Technology is supported by the Alfred P. Sloan Foundation under Grant No. G-2025-25191.

About The Alfred P. Sloan Foundation

The ALFRED P. SLOAN FOUNDATION is a not-for-profit, mission-driven grantmaking institution dedicated to improving the welfare of all through the advancement of scientific knowledge. Established in 1934 by Alfred Pritchard Sloan Jr., then-President and Chief Executive Officer of the General Motors Corporation, the Foundation makes grants in four broad areas: direct support of research in science, technology, engineering, mathematics, and economics; initiatives to increase the quality, equity, diversity, and inclusiveness of scientific institutions and the science workforce; projects to develop or leverage technology to empower research; and efforts to enhance and deepen public engagement with science and scientists.
sloan.org | @SloanFoundation

About Future of Privacy Forum (FPF)

FPF is a global non-profit organization that brings together academics, civil society, government officials, and industry to evaluate the societal, policy, and legal implications of data use, identify the risks, and develop appropriate protections. FPF believes technology and data can benefit society and improve lives if the right laws, policies, and rules are in place. FPF has offices in Washington D.C., Brussels, Singapore, and Tel Aviv. Follow FPF on X and LinkedIn.

About the University of Arizona Eller College of Management

The Eller College of Management at The University of Arizona offers highly ranked undergraduate (BSBA and BSPA), MBA, MPA, masters, and doctoral, Ph.D. degrees in accounting, economics, entrepreneurship, finance, marketing, management and organizations, management information systems (MIS), and public administration and policy in Tucson, Arizona and Phoenix, Arizona.

FPF and OneTrust publish the Updated Guide on Conformity Assessments under the EU AI Act

The Future of Privacy Forum (FPF) and OneTrust have published an updated version of their Conformity Assessments under the EU AI Act: A Step-by-Step Guide, along with an accompanying Infographic. This updated Guide reflects the text of the EU Artificial Intelligence Act (EU AIA), adopted in 2024.  

Conformity Assessments (CAs) play a significant role in the EU AIA’s accountability and compliance framework for high-risk AI systems. The updated Guide and Infographic provide a step-by-step roadmap for organizations seeking to understand whether they must conduct a CA. Both resources are designed to support organizations as they navigate their obligations under the AIA and build internal processes that reflect the Act’s overarching accountability. However, they do not constitute legal advice for any specific compliance situation. 

Key highlights from the Updated Guide and Infographic:

You can also view the previous version of the Conformity Assessment Guide here.

South Korea’s New AI Framework Act: A Balancing Act Between Innovation and Regulation

On 21 January 2025, South Korea became the first jurisdiction in the Asia-Pacific (APAC) region to adopt comprehensive artificial intelligence (AI) legislation. Taking effect on 22 January 2026, the Framework Act on Artificial Intelligence Development and Establishment of a Foundation for Trustworthiness (AI Framework Act or simply, Act) introduces specific obligations for “high-impact” AI systems in critical sectors, including healthcare, energy, and public services, and mandatory labeling requirements for certain applications of generative AI. The Act also includes substantial public support for private sector AI development and innovation through its support for AI data centers, as well as projects that create and provide access to training data, and encouragement of technological standardization to support SMEs and start-ups in fostering AI innovation. 

In the broader context of public policies in South Korea that are designed to allow the advancement of AI, the Act is notable for its layered, transparency-focused approach to regulation, moderate enforcement approach compared to the EU AI Act, and significant public support intended to foster AI innovation and development. We cover these in Parts 2 to 4 below. 

Key features of the law include:

In Part 5, we provide a comparison below to the European Union (EU)’s AI Act (EU AI Act). We note that while the AI Framework Act shares some common elements with the EU AI Act, including tiered classification and transparency mandates, South Korea’s regulatory approach differs in its simplified risk categorization, including absence of prohibited AI practices, comparatively lower financial penalties, and the establishment of initiatives and government bodies aimed at promoting the development and use of AI technologies. The intent of this comparison is to assist practitioners in understanding and analyzing key commonalities and differences between both laws.

Finally, Part 6 of this article places the Act within South Korea’s broader AI innovation strategy and discusses the challenges of regulatory alignment between the Ministry of Science and IT (MSIT) and South Korea’s data protection authority, the Personal Information Protection Commission (PIPC) in South Korea’s evolving AI governance landscape.

1. Background 

On 26 December 2024, South Korea’s National Assembly passed the Framework Act on Artificial Intelligence Development and Establishment of a Foundation for Trustworthiness (AI Framework Act or Act). 

The AI Framework Act was officially promulgated on 21 January 2025 and will take effect on 22 January 2026, following a one-year transition period to prepare for compliance. During this period, MSIT will assist with the issuance of Presidential Decrees and other sub-regulations and guidelines to clarify implementation details.

South Korea was the first country in the Asia-Pacific region to introduce a comprehensive AI law in 2021: the Bill on Fostering Artificial Intelligence and Creating a Foundation of Trust. However, the legislative process faced significant hurdles, including political uncertainty surrounding the April 2024 general elections, raising concerns that the bill could be scrapped entirely.

However, by November 2024, South Korea’s AI policy landscape had grown increasingly complex, with 20 separate AI governance bills since the National Assembly began its new term in June 2024, each independently proposed by different members. In November 2024, the Information and Communication Broadcasting Bill Review Subcommittee conducted a comprehensive review of these AI-related bills and consolidated them into a single framework, leading to the passage of the AI Framework Act.

At its core, the AI Framework Act adopts a risk-based approach to AI regulation. In particular, it introduces specific obligations for high-impact AI systems and generative AI applications. The AI Framework Act also has extraterritorial reach: it applies to AI activities that impact South Korea’s domestic market or users.

This blog post examines the key provisions of the Act, including its scope, regulatory requirements, and implications for organizations developing or deploying AI systems.

2. The Act establishes a layered approach to AI regulation

2.1 Definitions lay the foundation for how different AI systems will be regulated under the Act

Article 2 of the Act provides three AI-related definitions. 

At the core of the Act’s layered approach is its definition of “high-impact AI” (which is subject to more stringent requirements). “High-impact AI” refers to AI systems “that may have a significant impact on or pose a risk to human life, physical safety, and basic rights,” and is utilized in critical sectors identified under the AI Framework Act, including energy, healthcare, nuclear operations, biometric data analysis, public decision-making, education, or other areas that have a significant impact on the safety of human life and body and the protection of basic rights as prescribed by Presidential Decree.

The Act also introduces specific provisions for “generative AI.” The Act defines generative AI as AI systems that create text, sounds, images, videos, or other outputs by imitating the structure and characteristics of the input data. 

The Act also defines an “AI Business Operator” as corporations, organizations, government agencies, or individuals conducting business related to the AI industry. The Act subdivides AI Business Operators into two sub-categories (which effectively reflect a developer-deployer distinction): 

Currently, as will be covered in more detail below, the obligations under the Act apply to both categories of AI Business Operators, regardless of their specific roles in the AI lifecycle. For example, transparency-related obligations apply to all AI Business Operators, regardless of whether they are involved in the development and/or deployment phases of AI systems. It remains to be seen if forthcoming Presidential Decrees to implement the Act will introduce more differentiated obligations for each type of entity.

While the Act expressly excludes AI used solely for national defense and security from its scope, the Act applies to both government agencies and public bodies when they are involved in the development, provision, or use of AI technology in a business-related context. More broadly, the Act also assigns the government a significant role in shaping AI policy, providing support, and overseeing the development and use of AI.

2.2. The AI Framework Act has broad extraterritorial reach 

Under Article 4(1), the Act applies not only to acts conducted within South Korea but also to those conducted abroad that impact South Korea’s domestic market, or users in South Korea. This means that foreign companies providing AI systems or services to users in South Korea will be subject to the Act’s requirements, even if they lack a physical presence in the country. 

However, Article 4(2) of the Act introduces a notable exemption for AI systems developed and deployed exclusively for national defense or security purposes. These systems, which will be designated by Presidential Decree, fall outside the Act’s regulatory framework.

For global organizations, the Act’s jurisdictional scope raises key compliance considerations. Companies will likely need to assess whether their AI activities fall under South Korea’s regulatory reach, particularly if they:

This last criterion appears to be a novel policy proposition and differentiates the AI Framework Act from the EU AI Act, potentially making it broader in reach. This is because it does not seem necessary for an AI system to be placed on the South Korean market for the condition to be triggered, but simply for the AI-related activity of a covered entity to “indirectly impact” the South Korean market. 

2.3. The Act establishes a multi-layered approach to AI safety and trustworthiness requirements

(i) The Act emphasizes oversight of high-impact AI but does not prohibit particular AI uses 

For most AI Business Operators, compliance obligations under the AI Framework Act are minimal. There are, however, noteworthy obligations – relating to transparency, safety, risk management and accountability – that apply to AI Business Operators deploying high-impact AI systems. 

Under Article 33, AI Business Operators providing AI products and services must “review in advance” (this presumably means before the relevant product or service is released into a live environment or goes to market) whether their AI systems is considered “high-impact AI.” Businesses may request confirmation from the MSIT on whether their AI system is to be considered “high-impact AI.”

Under Article 34, organizations that offer high-impact AI, or products or services using high-impact AI, must meet much stricter requirements, including:

1. Establishing and operating a risk management plan.

2. Establishing and operating a plan to provide explanation for AI-generated results within technical limits, including key decision criteria and an overview of training data.

3. Establishing and operating “user protection measures.”

4. Ensuring human oversight and supervision of high-impact AI.

5. Preserving and storing documents that demonstrate measures taken to ensure AI safety and reliability.

6. Following any additional requirements imposed by the National AI Committee (established under the Act) to enhance AI safety and 7. reliability.

Under Article 35, AI Business Operators are also encouraged to conduct impact assessments for high-impact AI systems to evaluate their potential effects on fundamental rights. While the language of the Act (i.e., “shall endeavor to conduct an impact assessment”) suggests that these assessments are not mandatory, the Act introduces an incentive: where a government agency intends to use a product or service using high-impact AI, the agency is to prioritize AI products or services that have undergone impact assessments in public procurement decisions. Legislatively stipulating the use of public procurement processes to incentivize businesses to conduct impact assessments appears to be a relatively novel move and arguably reflects the innovation-risk duality seen across the Act.

(ii) The Act prioritizes user awareness and transparency for generative AI products and services 

The AI Framework Act introduces specific transparency obligations for generative AI providers. Under Article 31(1), AI Business Operators offering high-impact or generative AI-powered products or services must notify users in advance that the product or service utilizes AI. Further, under Article 31(2), AI Business Operators providing generative AI as a product or service must also indicate that output generated was generated by generative AI. 

Beyond general disclosure, Article 31(3) of the Act mandates that where an AI Business Operator uses an AI system to provide virtual sounds, images, video or other content that are “difficult to distinguish from reality,” the AI Business Operator must “notify or display the fact that the result was generated by an (AI) system in a manner that allows users to clearly recognize it.” 

However, the provision also provides flexibility for artistic and creative expressions. It permits notifications or labelling to be displayed in ways intended to not hinder creative expression or appreciation. This approach appears aimed at balancing the creative utility of generative AI with transparency requirements. Technical details, such as how notification or labelling should be implemented, will be prescribed by Presidential Decree.

(iii) The Act establishes other requirements that apply when certain thresholds are met

The following requirements focus on safety measures and operational oversight, including specific provisions for foreign AI providers.

Under Article 32, AI Business Operators that operate AI systems whose computational learning capacity exceeds prescribed thresholds are required to identify, assess, and mitigate risks throughout the AI lifecycle, and establish a risk management system to monitor and respond to AI-related safety incidents. AI Business Operators must document and submit their findings to the MSIT. 

For accountability, Article 36 provides that AI Business Operators without a domestic address or place of business and cross certain user number or revenue thresholds (to be prescribed) must appoint a “domestic representative” with an address or place of business in South Korea. The details of the domestic representative must be provided to the MSIT. 

These domestic representatives take on significant responsibilities, including:

3. The Act grants the MSIT significant investigative and enforcement powers

3.1 The legislation empowers the MSIT with broad authority to investigate potential violations of the Act 

Under Article 40 of the Act, the MSIT is empowered to investigate businesses that it suspects of breaching any of the following requirements under the Act:

When potential breaches are identified, the MSIT may carry out necessary investigations, including the authority to conduct on-site investigations and to compel AI Business Operators to submit relevant data. During these inspections, authorized officials can examine business records, operational documents, and other critical materials, following established administrative investigation protocols.

If violations are confirmed, the MSIT can issue corrective orders, requiring businesses to immediately halt non-compliant practices and implement necessary remediation measures. 

3.2 The Act takes a relatively moderate approach to penalties compared to other global AI regulations 

Under Articles 43 of the Act, administrative fines of up to KRW 30 million (approximately USD 20,707) may be imposed for:

This enforcement structure caps fines at lower amounts than other global AI regulations. 

4. The Act promotes the development of AI technologies through strategic support for data infrastructure and learning resources

The MSIT is responsible for developing comprehensive policies to support the entire lifecycle of AI training data, ensuring that businesses have access to high-quality datasets essential for AI development. To achieve this, the Act mandates government-led initiatives to:

A key initiative under the Act can be found in Article 25, which provides for the promotion of policies to establish and operate AI Data Centers. Under Article 25(2), the South Korean government may provide administrative and financial support to facilitate the construction and operation of data centers. These centers will provide infrastructure for AI model training and development, ensuring that businesses of all sizes – including small and medium-sized enterprises (SMEs) – have access to these resources.

The Act also promotes the advancement and safe use of AI by encouraging technological standardization (Articles 13 and 14), supporting SMEs and start-ups, and fostering AI-driven innovation. It also facilitates international collaboration and market expansion while establishing a framework for AI testing and verification (Articles 13 and 14). Together, these measures aim to strengthen South Korea’s broader AI ecosystem and ensure its responsible development and deployment.

5. Comparing the approaches of South Korea’s AI Framework Act and the EU’s AI Act reveals both convergences and divergences

As South Korea is only the second jurisdiction globally to enact comprehensive national AI regulation, comparing its AI Framework Act with the EU AI Act helps illuminate both its distinctive features and its place in the emerging landscape of global AI governance. As many companies will need to navigate both frameworks, understanding of their similarities and differences is essential for global compliance strategies.

Table 1. Comparison of Key Aspects of the South Korea AI Framework Act and EU AI Act

6. Looking ahead

South Korea’s AI Framework Act is the first omnibus AI regulation in the APAC region., The South Korean model is notable for establishing an alternative approach to AI regulation: one that seeks to balance the promotion of AI innovation, development, and use, along with safeguards for high-impact aspects.

6.1 Though the Act establishes a framework for direct regulation of AI, several critical areas require further definition through Presidential Decree

The areas that are expected to be clarified through Presidential Decree include:

The interpretation and implementation of these provisions will significantly shape compliance expectations, influencing how AI businesses—both domestic and international—navigate the regulatory landscape.

6.2 The Act must also be considered in the context of South Korea’s broader efforts to position the country as a leader in AI innovation 

The first – and arguably most significant – of these efforts is a significant bill recently introduced by members of the National Assembly, which seeks to amend the Personal Information Protection Act (PIPA) by creating a new legal basis for the processing of personal information specifically for the development and use of AI. The bill introduces a new Article 28-12, which would permit the use of personal information beyond its original purpose of collection, specifically for the development and improvement of AI systems. This amendment would allow such processing provided that:

Second, South Korea’s government is also reportedly exploring other legal reforms to its data protection law to facilitate the development of AI. According to PIPC Chairman Haksoo Ko’s recent interview with a global regulatory news outlet, these reforms could potentially include reforming the “legitimate interests” basis for processing personal information under the PIPA.

South Korea’s Minister for Science and ICT Yoo Sang-im has also reportedly urged the National Assembly to swiftly pass a law on the management and use of government-funded research data to advance scientific and technological development in the AI era.

Third, while creating these pathways for innovation, the PIPC has simultaneously been developing mechanisms to provide oversight over AI systems. For instance, the PIPC’s comprehensive policy roadmap for 2025 (Policy Roadmap) announced in January 2025 outlines an ambitious regulatory framework for AI governance and data protection. In particular, the Policy Roadmap envisions the implementation of specialized regulatory and oversight provisions for the use of unmodified personal data in AI development. 

The Policy Roadmap is supplemented by the PIPC’s Work Direction for Investigations in 2025 (Work Direction). Published in January 2025, the Work Direction includes measures intended to provide additional oversight over AI services, including conducting preliminary onsite inspections of AI-powered services, such as AI agents, and reviewing the use of personal information in AI-based legal and human resources services.

A possible instance of this additional emphasis on providing oversight arose in February 2025, when the PIPC announced a temporary suspension of new downloads of the Chinese generative AI application Deepseek over concerns about potential breaches of the PIPA.

Fourth, South Korea is seeking to strengthen the accountability of foreign organizations. The PIPC has expressed its support for a bill amending the PIPA’s domestic representative system for foreign organizations, which was subsequently amended and became effective from April 1, 2025. This amendment bill addresses a significant gap in the current system, which has allowed foreign companies to designate unrelated third parties as their domestic agents in South Korea, often resulting in what one lawmaker described as “formal” compliance without meaningful accountability.

The new requirements would mandate that foreign companies with established business units in South Korea designate those local entities as their representatives, while imposing explicit obligations on foreign headquarters to properly manage and supervise these domestic agents. The bill also establishes sanctions for violations of these requirements, including fines of up to KRW 20 million (approximately USD 14,000). 

Fifth, South Korea is seeking to position itself as a global leader in privacy and AI governance through international cooperation and thought leadership. As South Korea prepares to host the annual Global Privacy Assembly in September 2025 – an event involving participants from 95 countries – the PIPC is positioning itself as a bridge between different regional approaches to data protection and AI governance.

6.3 However, these efforts highlight a persistent challenge to ensure clear alignment between key regulatory authorities in South Korea’s AI governance landscape 

Whilst the MSIT was working to finalize the AI Framework Act, the PIPC, like its counterparts in many other jurisdictions globally, has been assuming a de facto regulatory role for AI applications involving personal data.

However, while the AI Framework Act assigns primary responsibility for AI governance to the MSIT, it does not appear to address or acknowledge the PIPC’s role in the regulatory landscape. This creates a potential situation where two parallel AI regulators – one de jure and the other de facto – will likely continue to operate: the MSIT overseeing general AI system safety and trustworthiness under the AI Framework Act, and the PIPC maintaining its oversight of personal data processing in AI systems under the PIPA.

As a result, organizations developing or deploying AI systems in South Korea may need to navigate compliance requirements from both authorities, particularly when their AI systems process personal data. How this dual regulatory structure evolves and whether a more unified governance approach emerges will be a critical factor in determining the success of South Korea’s ambitious AI strategy in the coming years.

Despite these practical challenges, South Korea’s approach to AI regulation offers a potential governance model for other APAC jurisdictions. Regardless, the success of the Act will ultimately depend on how effectively it balances its dual objectives — fostering AI innovation while ensuring responsible deployment. As AI governance evolves globally, the South Korean experience will provide valuable insights for policymakers, regulators, and industry stakeholders worldwide.

Note: Please note that the summary of the AI Framework Act above is based on an English machine translation, which may contain inaccuracies. Additionally, the information should not be considered legal advice. For specific legal guidance, kindly consult a qualified lawyer practicing in South Korea.

The authors would like to thank Josh Lee Kok Thong, Dominic Paulger, and Vincenzo Tiani for their contributions to this post.

Little Rock, Minor Rights: Arkansas Leads with COPPA 2.0-Inspired Law

With thanks to Daniel Hales and Keir Lamont for their contributions.

Shortly before the close of its 2025 session, the Arkansas legislature passed HB 1717, the Arkansas Children and Teens’ Online Privacy Protection Act, with unanimous votes. As the name suggests, Arkansas modeled this legislation after Senator Markey’s federal “COPPA 2.0” proposal, which passed the U.S. Senate as part of a broad child online safety package last year. Presuming enactment by Governor Sarah Huckabee Sanders, HB 1717 will take effect on July 1, 2026. The Arkansas law, or “Arkansas COPPA 2.0” establishes privacy protections for teens aged 13 to 16, introduces substantive data minimization requirements including prohibitions on targeted advertising, and provides new rights to access, delete, and correct personal information for teens. The legislature also considered an Arkansas version of the federal Kids Online Safety Act but this proposal ultimately failed, with the bill’s sponsor noting some uncertainties about its constitutionality.

What to know about Arkansas HB 1717: 

The substantive data minimization trend continues

While the federal COPPA framework is largely focused on consent, former Commissioner Slaughter noted in 2022 that people “may be surprised to know that COPPA provides for perhaps the strongest, though under-enforced, data minimization rule in US privacy law.” Arkansas builds on these requirements and follows the recent shift towards substantive data minimization with a complex web of layered requirements that operators must satisfy to use both child and teen data:

 In practice, the interaction between these distinct requirements may raise difficult questions of statutory interpretation.

Differences from federal COPPA 2.0

As originally introduced, Arkansas’s bill was nearly identical to last year’s federal COPPA 2.0 bill. Arkansas’ framework went through various, largely business-friendly amendments (and one bill number switch) during its legislative journey. Though HB 1717 maintains the same general framework of COPPA 2.0, it includes several important divergences:

Could COPPA preempt the Arkansas law?

One question likely to emerge from Arkansas COPPA 2.0 is whether certain provisions, or the entire law, may be subject to federal preemption under the existing COPPA statute. COPPA includes an express preemption clause that prohibits state laws from imposing requirements that are inconsistent with COPPA. This is relevant in two ways as the Arkansas law will both (1) extend protections to teens and (2) introduce new substantive limitations on the use of children’s and teens’ data, such as limits on targeted advertising and strict data minimization requirements, that go beyond COPPA’s scope. 

The question of COPPA preemption was recently explored in Jones v. Google, with the FTC filing an amicus brief arguing that state laws that “supplement” or “require the same thing” as COPPA are not inconsistent. The FTC references the Congressional record from when COPPA was contemplated, arguing that “Congress viewed ‘the States as partners’. . . rather than as potential intruders on an exclusively federal arena,” and that “the state law protections at issue ‘complement–rather than obstruct–Congress’ ‘full purposes and objectives in enacting the statute.’” Something to additionally keep in mind is that the FTC has been in the process of finalizing an update to the COPPA Rule and which could introduce additional inconsistencies, or at least compliance confusion, between the new final Rule and Arkansas COPPA 2.0 when it comes to key terms like the definition of personal information or whether targeted advertising is allowed with consent. 

A trend to watch?

The passage of Arkansas COPPA 2.0 may signal an emerging trend towards a potentially more constitutionally resilient approach to protecting children and teens online. Unlike age-appropriate design codes or social media age verification mandates, which have faced significant First Amendment challenges, Arkansas COPPA 2.0 takes a more targeted approach focused on privacy and data governance, rather than access, online safety, or content. Questions of preemption and drafting quirks aside, this approach may be on firmer ground by focusing on data protection practices and building on a longstanding federal privacy framework. As states explore new ways to safeguard youth online without triggering constitutional pitfalls, privacy-focused legislation modeled on COPPA standards could become a popular path forward. 

Chatbots in Check: Utah’s Latest AI Legislation

With the close of Utah’s short legislative session, the Beehive State is once again an early mover in U.S. tech policy. In March, Governor Cox signed several bills related to the governance of generative Artificial Intelligence systems into law. Among them, SB 332 and SB 226 amend Utah’s 2024 Artificial Intelligence Policy Act (AIPA) while HB 452 establishes new regulations for mental health chatbots.

The Future of Privacy Forum has released a chart detailing key elements of these new laws.

Amendments to the Artificial Intelligence Policy Act

SB 332 and SB 226 update Utah’s Artificial Intelligence Policy Act (SB 149), which took effect May 1, 2024. The AIPA requires entities using consumer-facing generative AI services to interact with individuals within regulated professions (those requiring a state-granted license such as accountants, psychologists, and nurses) to disclose that individuals are interacting with generative AI, not a human. The Act was initially set to automatically repeal on May 7, 2025. 

SB 332 extends the AIPA’s expiration date by two years, ensuring its provisions remain in effect until July 2027, while SB 226 narrows the law’s scope by limiting generative AI disclosure requirements only to instances when directly asked by a consumer or supplier, or during a “high-risk” interaction. The bill defines “high-risk” interactions to include instances where a generative AI system collects sensitive personal information and involves significant decisionmaking, such as in financial, legal, medical, and mental health contexts. SB 226 includes a safe harbor for AI suppliers if they provide clear disclosures at the start or throughout an interaction, ensuring users are aware they are engaging with AI. 

Mental Health Chatbots

Though HB 452 does not directly amend the AIPA, it is closely linked to the broader AI governance framework established by the law. As part of AIPA, Utah established a regulatory sandbox program and created the Office of Artificial Intelligence Policy to oversee AI governance and innovation in the state. One of the AI Office’s early priorities has been assessing the role of AI-driven mental health chatbots in licensed medical practice.

To address concerns surrounding these chatbots, the AI Office convened stakeholders to explore potential regulatory approaches. These discussions, along with the state’s first regulatory mitigation agreement under the AIPA’s sandbox program involving a student-focused mental health chatbot, helped shape the passage of HB 452. The bill establishes new rules governing the use of AI-driven mental health chatbots in Utah, including:

Utah’s latest round of legislation reflects a continued focus on targeted and risk-based regulation for emerging AI systems. Building on the foundation set by the 2024 Artificial Intelligence Policy Act, the new laws reflect an emerging national trend towards affirmatively supporting AI development and innovation while focusing regulatory interventions on particularly high-risk sectors such as healthcare. Utah’s approach to balancing innovation, regulation, and consumer protection in AI space may produce lessons and influence legislators in other states.

FPF Publishes Infographic, Readiness Checklist To Support Schools Responding to Deepfakes

Today, the Future of Privacy Forum (FPF) released an infographic and readiness checklist to help schools better understand and prepare for the risks posed by deepfakes. Deepfakes are realistic, synthetic media, including images, videos, audio, and text, created using a type of Artificial Intelligence (AI) called deep learning. By manipulating existing media, deepfakes can make it appear as though someone is doing or saying something that they never actually did. 

Deepfakes, while relatively new, are quickly becoming prevalent in K-12 schools. Schools have a responsibility to create a safe learning environment, and a deepfake incident – even if it happens outside of school – poses real risks to that, including through bullying and harassment, the spread of misinformation and disinformation, personal safety and privacy concerns, and broken trust.

FPF’s infographic describes the different types of deepfakes – video, text, image, and audio – and the varied risks and considerations posed by each in a school setting, from the potential for fabricated phone calls and voice messages impersonating teachers to sharing forged, non-consensual intimate imagery (NCII).

“Deepfakes create complicated ethical and security challenges for K-12 schools that will only grow as the technology becomes more accessible and sophisticated, and the resulting images harder to detect,” said Jim Siegl, Senior Technologist with FPF’s Youth & Education Privacy team. “Schools should understand the risks, their responsibilities and protocols in place to respond, and how they will protect students, staff, and administrators while addressing an incident.”

FPF has also developed a readiness checklist to support schools in assessing and preparing response plans. The checklist outlines a series of considerations for school leaders, from the need for education and training to determining how existing technology, policies, and procedures might apply to engaging legal counsel and law enforcement. 

The infographic maps out the various stages of a school’s response to an example scenario – a student reporting that they received a sexually explicit photo of a friend and that the image is circulating among a group of students – inviting school leaders to consider the following:

As an additional resource for school leaders and policymakers navigating the rapid deployment of AI and related technologies in schools, FPF has developed an infographic highlighting its varied use cases in an educational setting. While deepfakes are a new and evolving challenge, edtech tools using AI have been in schools for years.

FPF Privacy Papers for Policymakers: A Celebration of Impactful Privacy Research and Scholarship

The Future of Privacy Forum (FPF) hosted its 15th Privacy Papers for Policymakers (PPPM) event at its Washington, D.C., headquarters on March 12, 2025. This prestigious event recognized six outstanding research papers that offer valuable insights for policymakers navigating the ever-evolving landscape of privacy and technology. The evening featured engaging discussions and a shared commitment to advancing informed policymaking in digital privacy.

dsc 0747

FPF Board President Alan Raul

Daniel Hales, FPF Policy Fellow, kicked off the event as the emcee and recognized the contributions of FPF Board President Alan Raul and Board Secretary-Treasurer Debra Berlyn, along with the FPF staff who helped organize the gathering. Alan Raul, in his opening remarks, emphasized the significance of privacy scholarship and its relevance to policymakers worldwide. He noted that the PPPM event has, for 15 years, successfully brought together scholars, regulators, and industry leaders to discuss privacy research with real-world implications.

dsc 0742

Daniel Hales

Lee Matheson, FPF Deputy Director for Global Privacy, opened the discussion by introducing Professor Mark Jia (Georgetown University Law Center), who explored the evolution of privacy law in China. His paper, Authoritarian Privacy, challenges the notion that privacy is solely a Western concept and argues that China’s privacy framework has been shaped not only by state interests but also by public concerns. Professor Jia discussed the role of the Cyberspace Administration of China (CAC) and how privacy regulations have been influenced by social unrest and legitimacy concerns within the government. He emphasized that China’s Personal Information Protection Law (PIPL) is enforceable and not merely symbolic. Their discussion also touched on public “flashpoints” that have prompted government responses and the broader implications for understanding regulatory trends in authoritarian regimes.

dsc 0752

Professor Mark Jia and Lee Matheson

Professor Mark MacCarthy (Georgetown University) introduced Alice Xiang (Sony AI) to discuss her paper Mirror, Mirror, on the Wall, Who’s the Fairest of Them All?, which examines algorithmic bias in artificial intelligence models. Ms. Xiang’s research critiques the assumption that fair data sets automatically lead to fair AI outcomes and highlights the challenges in defining fairness. She noted that while engineers often bear the responsibility of addressing bias, broader policy frameworks are needed. Their discussion explored the tension between AI neutrality and the necessity for companies to engage with ethical and social justice considerations. Ms. Xiang argued that AI systems mirror existing societal inequalities rather than solve them and called for stronger regulatory oversight to ensure transparency and accountability in AI decision-making.

dsc 0777

Alice Xiang and Professor Mark MacCarthy

Next, Jocelyn Aqua (PwC) conversed with Miranda Bogen (Center for Democracy and Technology), whose paper Navigating Demographic Measurement for Fairness and Equity addresses the paradox of measuring fairness in AI while protecting individuals’ privacy. Ms. Bogen categorized fairness assessment into three key areas: measuring disparities, selecting appropriate metrics, and implementing mitigation strategies. She pointed out that privacy laws like GDPR and CCPA create barriers to demographic data collection, complicating efforts to assess bias in AI systems. The conversation emphasized the need for alternative privacy-preserving methods, such as statistical inference and qualitative analysis, to reconcile fairness assessments with privacy protections. Bogen called for policymakers to establish clearer guidelines that allow for responsible demographic measurement while ensuring compliance with privacy laws.

dsc 0795

Miranda Bogen and Jocelyn Aqua

The discussion then turned to Brenda Leong (ZwillGen), who introduced Tom Zick (Orrick, Herrington & Sutcliffe LLP) and Tobin South (Stanford University), two of the co-authors of the paper, Personhood Credentials: Artificial intelligence and the value of privacy-preserving tools to distinguish who is real online. Their paper explores the concept of “personhood credentials,” proposing a decentralized approach to verifying online identities while balancing security and privacy. The authors highlighted the risks posed by AI-driven identity fraud and the need for robust authentication mechanisms that protect user privacy. The conversation covered potential issuers of personhood credentials, including governments and private organizations, and the challenges of industry-wide adoption. Ultimately, the paper argues for the importance of developing privacy-first verification solutions that minimize data exposure while maintaining trust in digital interactions.

dsc 0803

Tobin South, Tom Zick, and Brenda Leong

Turning to another critical issue, Professor Daniel J. Solove (George Washington University Law School) discussed his paper (co-authored by Boston University Professor Woodrow Hartzog) The Great Scrape: The Clash Between Scraping and Privacy with Jennifer Huddleston (Cato Institute). Professor Solove examined the legal and ethical complexities of data scraping, arguing that while scraping has long existed in a legal gray area, the rise of AI has heightened privacy concerns. He challenged the perception that publicly available data is free for unrestricted use, noting that privacy laws are evolving to address these issues. The discussion explored potential regulatory solutions, emphasizing the importance of distinguishing between beneficial scraping and harmful practices that exploit personal data. Professor Solove advocated for a public interest standard to determine when scraping should be permissible and called for clearer legal frameworks to protect individuals from data misuse.

dsc 0831

Professor Daniel J. Solove and Jennifer Huddleston

In the last discussion, Professor James C. Cooper (Antonin Scalia Law School – George Mason University) joined Professor Alicia Solow-Niederman (George Washington University Law School) to discuss her paper The Overton Window and Privacy Enforcement. Professor Solow-Niederman explained how internal norms, congressional oversight, judicial rulings, and public sentiment collectively shape the Federal Trade Commission’s (FTC) approach to privacy enforcement. The conversation also highlighted recent cases where the FTC has expanded its enforcement scope, including actions against data brokers and algorithmic decision-making. The paper argues that policymakers need to balance their legal authority with the evolving public expectations to ensure effective privacy enforcement.

dsc 0847

Professor Alicia Solow-Niederman and Professor James C. Cooper

John Verdi, FPF’s Senior Vice President for Policy, closed the event by thanking the winning authors, discussants, event team, and FPF’s Daniel Hales for their contributions. He highlighted FPF’s role in bringing together academia, policy, and industry experts to promote meaningful discussions on privacy.

dsc 0858

Read the 15th Annual Privacy Papers for Policymakers Digest

FPF Releases Report on the Adoption of Privacy Enhancing Technologies by State Education Agencies

The Future of Privacy Forum (FPF) released a landscape analysis of the adoption of Privacy Enhancing Technologies (PETs) by State Education Agencies (SEAs). As agencies face increasing pressure to leverage sensitive student and institutional data for analysis and research, PETs offer a unique potential solution as they are advanced technologies designed to protect data privacy while maintaining the utility of results yielded from analyses. 

FPF worked with AEM Corporation to conduct a landscape analysis, including an overview of current PETs adoption, current challenges, and considerations for enhancing data protection measures. The landscape analysis, first previewed in a late 2024 webinar and expert panel discussion, evaluated the organizational readiness and critical use cases for PETs within SEAs and the broader education sector, ultimately highlighting the need to raise awareness of what PETs are and what they are not, the range of available types of PETs, their potential use cases, and considerations for the effective adoption and sustainable implementation of these technologies. 

“Intentional PETs implementation can boost community trust, enhance data analysis, and effectively ensure critical privacy protections,” said Jim Siegl, FPF Senior Technologist for Youth & Education Privacy. “But as our landscape analysis highlights, despite the advances PETs offer to SEAs in utilizing the data they steward, a gap persists in applying these technologies and realizing their potential benefits.”

Key findings outlined in the report include:

The report also outlines a series of recommendations to support PET adoption at scale, including establishing a shared vocabulary, creating trusted introductory resources, and curating relevant use cases to raise collective awareness about the capabilities and limitations of PETs. Additional recommendations include developing a PETs readiness model, focusing on core capabilities, and providing targeted technical assistance to support sustainable PET adoption and implementation. 

Recognizing the need for a deeper understanding of the potential and limitations of these technologies, FPF has actively contributed to shaping policymaking around PETs through discussion papers, reports, and stakeholder engagement. FPF’s PETs Repository, launched in November 2024, is a centralized, trusted, and up-to-date resource where individuals and organizations interested in these technologies can find practical and useful information.