Cross-Border Data Flows in Africa: Examining Policy Approaches and Pathways to Regulatory Interoperability
Cross-border data flows are critical to Africa’s digital economy, enabling trade, innovation, and access to continental and global markets. As the drive towards data-driven technologies among businesses and governments grows, the ability to transfer personal data across borders efficiently and securely has become a key policy concern on the continent, a position echoed by the African Union (AU) and its Member States. This Issue Brief provides an overview of the current policy landscape for inter-African cross-border data flows, and proposes possible paths toward regulatory cooperation.
The Issue Brief begins by highlighting ongoing sub-regional effortsto shape frameworks for cross-border data flows, including through the work by the African Union, the Economic Community of East African States (ECOWAS), the East Africa Community (EAC), and the Southern Africa Development Community (SADC). These efforts show early alignment toward shared standards, but also underline the diversity of legal frameworks and enforcement capacity across jurisdictions.
The Brief introduces a taxonomy of cross-border data regimes in Africa, identifying two common approaches: The first encompasses countries with no cross-border data flows provisions, either because such provisions are omitted from the law or countries lack comprehensive data protection laws in entirety; and the second approach includes countries with restrictions for transferring personal data to other African countries
To operationalize inter-African cross-border data flows, legal frameworks on the continent increasingly reference data transfer tools. The Issue Brief explores the use and implementation of mechanisms such as adequacy decisions, certification mechanisms, standard contractual clauses (SCCs), and binding corporate rules (BCRs) and derogations, currently in use across Kenya, Nigeria, South Africa, Rwanda, and Ivory Coast. This comparative analysis highlights that the practical implementation of transfer tools remains uneven across the continent, and many countries lack clear guidance or infrastructure to support their use.
In the final section of the Issue Brief, we outline policy considerations and opportunities for convergence on cross-border data flows across the continent, encouraging African countries to work toward interoperable data transfer frameworks that reflect shared values.
FPF Unveils Paper on State Data Minimization Trends
Today, the Future of Privacy Forum (FPF) published a new paper—Data Minimization’s Substantive Turn: Key Questions & Operational Challenges Posed by New State Privacy Legislation. Data minimization is a bedrock principle of privacy and data protection law, with origins in the Fair Information Practice Principles (FIPPs) and the Privacy Act of 1974. At a high level, data minimization prohibits a covered entity from collecting, using, or retaining more personal data than is necessary to accomplish an identified, lawful purpose.
In recent years, data minimization has emerged as a contested and priority issue in privacy legislation. Under many existing state privacy laws, companies have been subject to “procedural” data minimization requirements whereby collection and use of personal data is permitted so long as it is adequately disclosed or consent is obtained. As privacy advocates have pushed to shift away from notice-and-choice, some policymakers have begun to embrace new “substantive” data minimization rules that aim to place default restrictions on the purposes for which personal data can be collected, used, or shared, typically requiring some connection between the personal data and the provision or maintenance of a requested product or service. This white paper explores this ongoing trend towards substantive data minimization, with a focus on the unresolved questions and policy implications of this new language.
Part I of the paper identifies the relevant standards: procedural data minimization (the majority rule); substantive data minimization (the rule that is currently law in Maryland and several sectoral laws); and reasonable expectations (the approach taken by California). This rise of substantive data minimization rules raises a number of challenges and unresolved questions, which are explored in Part II. Some of these questions include the role of consent, what is a “requested” product or service, and what is “necessary” to provide a requested product or service.
For its proponents, this substantive turn promises to better align companies’ collection and use of personal data with consumers’ reasonable expectations. For its detractors, however, this trend threatens to upend longstanding business practices, introduce legal uncertainty, and threaten socially beneficial uses of data. The core of this debate is really the societal value of different uses of data, and whether certain data uses should be allowed, encouraged, discouraged, or prohibited by default, which itself is a proxy for major economic and political decisions with vast societal implications. How these questions are resolved will have significant implications for economic activity and data-intensive business practices, including advertising, artificial intelligence, and product improvement generally. The paper concludes by briefly outlining several options for how to construct a substantive data minimization rule that is forward looking and flexible.
Vermont and Nebraska: Diverging Experiments in State Age-Appropriate Design Codes
In May 2025, Nebraska and Vermont passed Age-Appropriate Design Code Acts (AADCs), continuing the bipartisan trend of states advancing protections for youth online. While these new bills arrived within the same week and share both a common name and general purpose, their scope, applicability, and substance take two very different approaches to a common goal: crafting a design code that can withstand First Amendment scrutiny.
Much like the divergence in “The Road Not Taken,” each state has taken its version of the path less traveled in crafting an AADC, informed by different assumptions about risks to minors online, risks of constitutional challenges, and enforcement priorities. As states grapple with legal challenges to earlier AADCs (California’s law remains blocked and a lawsuit was filed against Maryland’s law earlier this year) Nebraska and Vermont demonstrate how policymakers are experimenting with divergent frameworks in hopes of creating constitutionally sound models for youth online privacy and safety.
See our comparison chart for a full side-by-side comparison between the Nebraska Age-Appropriate Design Code Act (LB 504) and Vermont Age-Appropriate Design Code Act (S.69).
Each AADC’s scope turns on two key provisions – business thresholds tied to revenue and number of affected users, and an applicability standard based on either audience composition or “knowedge” of minor users on the service.
Business thresholds
Both the Nebraska and Vermont AADCs have narrower applicability than prior child online safety bills, though adopt different approaches to determining in-scope businesses.
Nebraska’s law applies only to businesses that derive more than half their revenue from selling or sharing personal data. This is an unusually high bar that could exclude many common services used by minors, including many platforms and services that are primarily supported by advertising revenue and subscriptions. Additionally, Nebraska includes a carveout for services that can demonstrate fewer than 2% of their users are minors. In contrast, the Vermont AADC likely has a broader applicability, but still only applies to businesses that derive a majority of their revenue from online services generally, regardless of how they monetize.
Applicability of when a service must apply minor protections
Another major divergence between the two AADCs lies in the circumstances under which covered businesses are deemed to know that a user is a child and required to provide heightened protections and controls.
Nebraska adopts an “actual knowledge” standard. However, the law defines “actual knowledge” as all information and inferences known to the covered business, including marketing data. Given that marketing segmentation can be as broad as “Gen Z,” covering anyone born from the late 90s to early 2010s, Nebraska’s law demonstrates an intent to construe actual knowledge broadly. Nevertheless, the law explicitly states that businesses are not required to collect age data to comply, which has been a hotly contested requirement under other state laws, as age verification requirements are historically not the least restrictive means of protecting children online and often impact the protected speech of adults.
Vermont takes a different path, triggering obligations when a service is “reasonably likely” to be accessed by minors, establishing a multifactor test that includes internal research and overall audience composition. Vermont’s approach is more akin to an audience assessment like COPPA’s “directed to children” standard for children under age 13. Though, from a practical standpoint, it’s likely that most websites online are reasonably likely to be accessed by at least some minors under the age of 18 who would be in scope of the Vermont AADC. Vermont’s Attorney General is also tasked with developing age assurance rules, including privacy-preserving techniques and guardrails; however, it is not clear whether the AG may seek to compel businesses to affirmative conduct age assurance through this rulemaking, and when questioned, the AG’s office said it was up to legislative intent.
In short, Nebraska seeks to explicitly avoid requiring age verification altogether, while Vermont seems to set the stage for proactive assessment and regulation on age estimation.
Designing around harm without regulating content
Vermont’s AADC contains a duty of care to protect minors in the design of online products but adds important disclaimers in a nod to First Amendment concerns that have plagued similar requirements in other state laws. Covered businesses must design services to avoid reasonably foreseeable emotional distress, compulsive use, or discrimination. However, the bill clarifies that the mere content that a minor views cannot, by itself, constitute harm. Nebraska, by contrast, does not create a duty of care.
To date, most Age-Appropriate Design Code bills have exclusively focused on tools and protections for covered minors. Nebraska breaks from this mold by requiring businesses to build tools for parents to help them monitor and limit their child’s use of online services. This section likely draws inspiration from the federal Kids Online Safety Act, which earlier versions of the Nebraska framework more closely resembled.
Both states require covered services to set strong default privacy settings, but Vermont takes a more granular approach. It explicitly prohibits providing users with a single “less protective” setting that would override others, explicitly limiting the use of all-in-one privacy toggles. Furthermore, a number of its default setting requirements only apply to social media platforms, a divergence from prior AADCs whose requirements have generally been agnostic to the type of online service. For example, Vermont prohibits allowing known adults to like, comment, or otherwise provide feedback on a covered minor’s media on social media. This would be allowed to the extent any non-social media platforms have this type of functionality. In contrast to Vermont’s default settings approach to safer design, Nebraska requires covered businesses to develop various tools for minors. In some instances, these tools overlap with the default settings called for in Vermont and are just a different statutory approach of arriving at the same goal, such as tools for restricting the collection of geolocation data or communicating with unknown adults. Other tools are unique and novel to Nebraska, such as a tool that allows a minor to “opt out of all unnecessary features.” Businesses in scope of both frameworks will need to do a close read to determine what new features, settings, and tools must be implemented.
Both frameworks omit requirements for businesses to complete data protection impact assessments, which emerged as one of the key issues with the California AADC, due to California’s requirement to assess and limit the exposure of children to “potentially” harmful content. While the Ninth Circuit did not hold that risk assessments are per se unconstitutional, and the primary issue in California lay with requiring companies to opine on content-based harms, both Nebraska and Vermont steer away from this issue altogether. Instead, Vermont’s framework would require businesses to issue detailed public transparency reports, including on their use of algorithmic recommendation systems, including disclosure of inputs and how they influence results.
When it comes to targeted advertising, Nebraska is explicit: it prohibits facilitating targeted ads to minors, while allowing exceptions for first-party and contextual advertising. Vermont is less direct, but forbids the use of personal data to prioritize media for viewing unless requested by the minor, which may effectively ban both personalized advertising and certain practices for organizing content based on user interests (though the framework’s algorithmic disclosure requirements suggests an intent that many such systems may remain in use).
Nebraska prohibits the use of so-called “dark patterns” outright – an unusually broad ban that goes beyond previous state privacy laws, which have focused on manipulative practices in obtaining consent or collecting personal information. Instead, Nebraska seeks to prohibit any user interface with the effect of subverting or impairing autonomy, decision-making, or choice. A strict reading of this provision could arguably impact a broad range of design choices including a video game that restricts access to certain areas until you defeat a boss, a button asking you if you’d like to continue, or the content of advertisements (though remember – the number of businesses subject to Nebraska appear incredibly narrow). In contrast, Vermont defers to future rulemaking, authorizing its Attorney General to define and prohibit manipulative design practices by 2027.
Effective dates and next steps
Governor Pillen signed the Nebraska AADC within days of its passage and the law is slated to go into effect on January 1, 2026. However, the Act gives companies some leeway, as the Attorney General is not able to bring actions to recover civil penalties until July 1, 2026. The Vermont AADC would establish a longer onramp for coming into compliance, with an effective date of January 1, 2027. Governor Scott is still considering the bill, though he vetoed a similar effort last year that was included as part of a broader comprehensive privacy package. Assuming the Vermont AADC is enacted, the Attorney General is expected to complete rulemaking on manipulative design practices and methods for conducting age estimation by the effective date.
Conclusion
With courts signaling that speech-based online safety rules are unlikely to survive First Amendment scrutiny, Nebraska and Vermont are two distinct experiments in how to try to achieve the goal of protecting children online in constitutionally resilient ways. NetChoice, the litigant challenging the California and Maryland AADCs, has already raised First Amendment concerns with both the Nebraskaand Vermont frameworks.
Each legislature has taken its own “road less traveled” to children’s online safety. Nebraska has opted for a limited scope, feature-driven approach with no rulemaking and an emphasis on actual knowledge. Vermont has chosen a broader duty-of-care model, backed by a robust rulemaking directive and novel transparency requirements. Both paths attempt to avoid the pitfalls of California’s and Maryland’s laws, but take radically diverging routes in doing so. Which, if either, road “has made all the difference” will ultimately depend on courts, compliance practices, and the experience of minors navigating these services in the years to come.
FPF Experts Take The Stage at the 2025 IAPP Global Privacy Summit
By FPF Communications Intern Celeste Valentino
Earlier this month, FPF participated at the IAPP’s annual Global Privacy Summit (GPS) at the Convention Center in Washington, D.C. The Summit convened top privacy professionals for a week of expert workshops, engaging panel discussions, and exciting networking opportunities on issues ranging from understanding U.S. state and global privacy governance to the future of technological innovation, policy, and professions.
FPF started out the festivities by hosting its annual Spring Social with a night full of great company, engaging discussions, and new connections. A special thank you to our sponsors FTI Consulting, Perkins Coie, Qohash, Transcend, and TrustArc!
The IAPP conference started with FPF Senior Director for U.S. Legislation Keir Lamont, who led an informative workshop, “US State Privacy Crash Course – What Is New and What Is Next” with Lothar Determann (Partner, Baker McKenzie) and David Stauss (Partner, Husch Blackwell). The workshop provided an overview of recent U.S. state privacy legislation developments and a lens into how these laws fit into the existing landscape.
The next day, FPF Senior Fellow Doug Miller hosted an insightful discussion with Jocelyn Aqua (Principal, PwC), providing guidance and tools for privacy professionals to avoid workplace burnout. Both began the discussion by arguing that because privacy professionals face different organizational and positional pressures from other business professionals, they experience varying types of burnout that require alternative remedies. The experts then detailed each kind of burnout and provided solutions for how individuals, teams, and leaders can provide support to avoid them. “Giving your team transparency about a decision gives them control, and feeling better about a decision,” Doug explained, highlighting leaders’ vital role in mitigating workplace burnout. You can find additional resources from Doug’s full presentation here.
Next, FPF Vice President for Global Privacy Gabriela Zanfir-Fortuna, moderated a compelling conversation amongst European legislators, including Brando Benifei (Member of European Parliament, co-Rapporteur of the AI Act), John Edwards (Information Commissioner, U.K. Information Commissioner’s Office), and Louisa Specht-Riemenschneider (Federal Commissioner for Data Protection and Freedom of Information, Germany), on Cross-regulatory Cooperation Between Digital Regulators.
Their panel began by painting a detailed portrait of how the proliferation of digital regulations has created a necessity for cross-regulatory collaboration between differing authorities. Using the EU Artificial Intelligence (AI) Act as an example, the panelists argued that the success of cross-regulation hinges on cooperation and knowledge sharing between data protection agencies of different countries. “It’s important to see how the authority of the data protection authority remains relevant and at the center of regulation around AI. One interesting point in the AI Act is that in the Netherlands, there were around 20 authorities appointed as having competence to enforce and regulate to a certain extent under the AI Act; this speaks to how complex the landscape is,” examined Gabriela Zanfir-Fortuna, Vice President for Global Privacy.
The panel also dissected concrete ways regulators can work together to enable cross-regulation, including a mandatory collaboration mechanism, supervisory authorities, and a more unified approach from governments and regulators alike.
FPF CEO Jules Polonetsky served as a moderator of a timely dialogue among high-ranking leaders, including Kate Charlet (Director, Privacy, Safety, and Security; Government Affairs and Public Policy, Google), Kate Goodloe (Managing Director, Policy, BSA, The Software Alliance), and Amanda Kane Rapp (Head of Legal, U.S. Government, Palantir Technologies), covering tech in an evolving political era.
The panel highlighted recent and expected shifts in technology, cybersecurity, privacy, AI governance, and online safety within a new U.S. executive administration. Jules commenced the panel posing, “We’ve seen increasing clashes between privacy and competition, privacy and kids’ issues, etc. Has anything changed in the current environment?” The panelists agreed that, regardless of government dynamics, privacy issues remain relevant for technology companies to address to protect and foster trust in the digital ecosystem with consumers. The panel also provided a master perspective on how tech leaders approach digital governance now and in the future through promoting interoperability, model transparency, and government experimentation and implementation of IT tools and procurement.
On the second day of the conference, FPF Managing Director for Asia-Pacific (APAC) Josh Lee Kok Thong, spoke on a panel with Darren Grayson Chng, (Regional Data Protection Director, Asia Pacific, Middle East, and Africa, Electrolux), Haksoo Ko (Chairperson, Personal Information Protection Commission, Republic of Korea), and Angela Xu (Senior Privacy Counsel, APAC Head, Google) exploring the nuanced landscape of AI regulation in Asia-Pacific.
Through the panel, the discussants highlighted the differing AI regulatory approaches across the Asia-Pacific region, noting that most APAC jurisdictions have preferred not to enact hard AI laws. Instead, these regions focus on regulating elements of AI systems such as the use of personal data (Singapore), addressing risk in AI systems (Australia), promoting industry development (South Korea), fostering international cooperation, and responsible AI practices (Japan), government oversight of deployment of AI systems (India) and regulating misinformation and personal information protection (China). “The APAC region is like a huge experimental lens for AI regulation, with different jurisdictions trying out different approaches, so do pay attention to this region because it will be very influential going forward. There will be increasing diversity and regulation,” Josh noted, providing valuable insider insight about where audience members should focus their attention.
Throughout the week, FPF’s booth in the Exhibition Hall was a popular stop for IAPP GPS attendees. Policymakers, industry leaders, and privacy scholars stopped by our booth to learn more about FPF memberships, connect with FPF staff, and learn more about FPF’s ongoing issues, ranging from the future of regulating AI agents to helping schools defend against deepfakes in the classroom. Visitors to the booth stopped by to speak with FPF staff and left with a collection of infographics, membership resources, and an “I Love Privacy” sticker.
FPF hosted two roundtable discussions early in the week, with Vice President for Global Privacy, Gabriela Zanfir-Fortuna, leading conversations on “Navigating Transatlantic Affairs and the EU-US Digital Regulatory Landscape” and “India’s new Data Protection law and what to expect from its implementation phase.” FPF’s U.S. Legislation team also hosted an event at our D.C. office for members to connect with the team and each other to discuss the U.S. legislative landscape.
FPF also hosted two Privacy Executives Network breakfasts and a lunch during the Summit week featuring peer-to-peer discussions top-of-mind issues in data protection and privacy and AI Governance. We discussed the current EU privacy landscape with Commissioner for Data Protection and Chairperson of the Irish Data Protection Commission, Des Hogan, and we spoke with Colorado Attorney General Office’s First Assistant Attorney General, Technology & Privacy Protection Unit, Stevie DeGroff. These roundtable discussions allowed our members to discuss critical topics with one another in a private and dynamic meeting.
In partnership with the Mozilla Foundation, we also hosted a PETs Workshop featuring short, expert panels exploring new and emerging Privacy Enhancing Technology (PETs) applications. Technology and policy experts presented several leading PETs use cases, analyzed how PETs work with other privacy protections, and discussed how PETs may intersect with data protection rules. This workshop was the first time that several of the use cases were shared in detail with independent experts.
We hope you enjoyed this year’s IAPP Global Privacy Summit as much as we did! If you missed us at our booth, visit FPF.org for all our reports, publications, and infographics. Follow us on X, LinkedIn, Instagram, and YouTube, and subscribe to our newsletter for the latest.
Lessons Learned from FPF “Deploying AI Systems” Workshop
On May 7, 2025, the Future of Privacy Forum (FPF) hosted a “Deploying AI Systems” workshop at the Privacy + Security Academy’s Spring Academy, which took place at The George Washington University in Washington, DC. Workshop participants included students and privacy lawyers from firms, companies, data protection authorities, and regulatory agencies around the world.
Pictured left to right: Daniel Berrick, Anne Bradley, Bret Cohen, Brenda Leong, and Amber Ezzell
The two-part workshop explored the emerging U.S. and global legal requirements for AI deployers, and attendees engaged in exercises involving case studies and demos on managing third-party vendors, agentic AI, and red teaming. The workshop was facilitated by FPF’s Amber Ezzell, Policy Counsel for Artificial Intelligence, who was joined by Anne Bradley (Luminos.AI), Brenda Leong (ZwillGen), Bret Cohen (Hogan Lovells), and Daniel Berrick (FPF).
From the workshop, a few key takeaways emerged:
When vetting third-party AI tools, deployers agreed that it is necessary to independently test the tools using their own data, rather than relying on representations made by third party vendors – especially for “high risk” use cases. This is due to the growing amount of regulatory interest in unfair and deceptive practices pertaining to AI deployment (e.g. misleading statements about the capabilities, nature of implementation, and data collection and management practices of AI tools). Regulators are also concerned with whether organizations are monitoring and testing for accuracy, discriminatory, or biased outputs.
Most deployer organizations feel they are facing significant constraints on resources for AI risk management, and that they are having to “do more with less.” In comparison, organizations are investing more resources towards AI adoption and innovation; nevertheless, they agreed on the importance of having a risk-based approach to AI deployment for mitigating risk and regulatory pitfalls.
Despite the buzz about “AI agents,” agentic systems are not yet a main focus of risk governance for most participants. Nonetheless, agentic systems may soon begin to pressure test or amplify governance questions relevant to more widely deployed forms of AI (e.g. general purpose LLMs or automated decisionmaking tools).
As organizations, policymakers, and regulators grapple with the rapidly evolving landscape of AI development and deployment, FPF will continue to explore a range of issues at the intersection of AI governance.
If you have any questions, comments, or wish to discuss any of the topics related to the Deploying AI Systems workshop, please do not hesitate to reach out to FPF’s Center for Artificial Intelligence at [email protected].
Amendments to the Montana Consumer Data Privacy Act Bring Big Changes to Big Sky Country
On May 8, Montana Governor Gianforte signed SB 297, amending the Montana Consumer Data Privacy Act (MCDPA). This amendment was sponsored by Senator Zolnikov, who also championed the underlying law’s enactment in 2023. Much has changed in the state privacy law landscape since the MCDPA was first enacted, and SB 297 incorporates elements of further reaching state laws into the MCDPA while declining to break new ground. For example, SB 297 adopts heightened protections for minors like those in Connecticut and Colorado as well as privacy notice requirements and a narrowed right of access like in Minnesota’s law. The bill does not include an effective date for these new provisions, so by default the amendments should take effect on October 1, 2025.
This blog post highlights the important changes made by SB 297 and some key takeaways about what this means for the comprehensive consumer privacy landscape. Changes to the law include (1) a duty of care with respect to minors, (2) new requirements for processing minors’ personal data, (3) a disclaimer that the law does not require age verification, (4) lowered applicability thresholds and narrowed exemptions, (5) a narrowed right of access that prohibits controllers from disclosing certain sensitive information, (6) expanded privacy notice requirements, and (7) modifications to the law’s enforcement provisions. With these changes, Montana yet again reminds us that privacy remains a bipartisan issue as SB 297, like its underlying law, was passed with overwhelmingly bipartisan votes.
1. New Connecticut- and Colorado-style duty of care with respect to minors.
The biggest changes to the MCDPA concern protections for children and teenagers. Like legislation enacted by Connecticut in 2023 and Colorado in 2024, SB 297 amends the MCDPA to add privacy protections for consumers under the age of 18 (“minors”). These new provisions apply more broadly than the rest of the law, covering entities that conduct business in Montana without any small business exceptions (i.e., there are no numerical applicability thresholds, although the law’s entity-level and data-level exemptions still apply).
Under these new provisions, any controller that offers an online service, product, or feature to a consumer whom the controller actually knows or wilfully disregards is a minor must use “reasonable care” to avoid a “heightened risk of harm to minors” caused by the online service, product, or feature (“online service”). Heightened risk of harm to minors is defined as processing a minor’s personal data in a manner that presents a “reasonably foreseeable risk” of: (a) Unfair or deceptive treatment of, or unlawful disparate impact on, a minor; (b) financial, physical, or reputational injury; (c) unauthorized disclosure of personal data as a result of a security breach (as described in Mont. Code Ann. § 30-14-1704); or (d) intrusion upon the solitude or seclusion or private affairs or concerns of a minor, whether physical or otherwise, that would be offensive to a reasonable person. This definition largely aligns with some of the existing triggers for conducting a data protection assessment under the MCDPA.
At a time when many youth privacy and online safety bills, such as the California Age-Appropriate Design Code (AADC), are mired in litigation over their constitutionality, it is notable that three states—Connecticut, Colorado, and Montana—have now opted for the framework in SB 297. Given that neither Connecticut’s nor Colorado’s laws have been subject to any constitutional challenges as of yet, this approach could be a more constitutionally resilient way than the AADC model to impose a duty of care with respect to minors. Specifically, the duties of care in Connecticut’s, Colorado’s, and now Montana’s laws are rooted in traditional privacy harms and torts (e.g., intrusion upon seclusion) whereas other frameworks that have been challenged have more amorphous concepts of harm that are more likely to implicate protected speech (e.g., the enjoined California AADC requires addressing whether an online service’s design could harm children by exposing them to “harmful, or potentially harmful, content”).
2. Controllers are entitled to a rebuttable presumption of having exercised reasonable care if they comply with statutory requirements.
Under Montana’s new duty of care to minors, a controller is entitled to a rebuttable presumption that it used reasonable care if it complies with certain statutory requirements related to design and personal data processing. With respect to design, controllers are prohibited from using consent mechanisms that are designed to impair user autonomy, they are required to establish easy-to-use safeguards to limit unsolicited communications from unknown adults, and they must provide a signal indicating when they are collecting precise geolocation data. For processing, controllers must obtain a minor’s consent before: (a) Processing a minor’s data for targeted advertising, sale, and profiling in furtherance of decisions that produce legal or similarly significant effects; (b) “us[ing] a system design feature to significantly increase, sustain, or extend a minor’s use of the online service, product, or feature”; or (c) collecting precise geolocation data, unless doing so is “reasonably necessary” to provide the online service, or retaining that data for longer than “necessary” to provide the online service.
Controllers subject to these provisions must also conduct data protection assessments for an online service “if there is a heightened risk of harm to minors.” These data protection assessments must comply with all existing requirements under the MCDPA and must provide additional information such as the online service’s purpose, the categories of personal data processed, and the processing purposes. Data protection assessments should be reviewed “as necessary” to account for material changes, and documentation should be retained for either 3 years after the processing operations cease, or the date on which the controller ceases offering the online service, whichever is longer. If a controller conducts an assessment and determines that a heightened risk of harm to minors exists, it must “establish and implement a plan to mitigate or eliminate the heightened risk.”
Although the substantive requirements of the protections for minors are substantively similar between Connecticut’s, Colorado’s, and Montana’s laws, these states are not fully aligned with respect to the rebuttable presumption of reasonable care. Montana follows Colorado’s approach, whereby a controller is entitled to the rebuttable presumption if it complies with the processing and design restrictions described above. Connecticut’s law, in contrast, provides that a controller is entitled to the rebuttable presumption of having used reasonable care if the controller complies with the data protection assessment requirements.
3. The bill clarifies that Montana’s privacy law does not require age verification.
In addition to adding a duty of care and design and processing restrictions with respect to minors, SB 297 makes a small change to existing adolescent privacy protections. The existing requirement that a controller obtain a consumer’s consent before engaging in targeted advertising or selling personal data for consumers aged 13–15 now applies when a controller willfully disregardsthe consumer’s age, not just if the controller has actual knowledge of their age. This knowledge standard aligns with that in similar opt-in requirements for adolescents in California, Connecticut, Delaware, New Hampshire, New Jersey, and Oregon. It also aligns with the broader duty of care protections in SB 297, which apply when a controller “actually knows or willfully disregards” that a consumer is a minor. This change may be negligible, however, as the amendment already requires any controller that offers an online service, product, or feature to a consumer whom the controller actually knows or wilfully disregards is a minor (under 18) to obtain consent before processing a minor’s data for targeted advertising, sale, and profiling in furtherance of decisions that produce legal or similarly significant effects.
These new protections and the introduction of a “willfully disregards” knowledge standard for minors implicate a broad, contentious policy debate over age verification, the process by which an entity affirmatively determines the age of individual users, often through the collection of personal data. Across the country, courts are litigating the constitutionality of such requirements under other laws. Presumably to head-off any such constitutional challenges, SB 297 explicitly provides that nothing in the law shall require a controller to engage in age-verification or age-gating. However, it also provides that if a controller chooses to conduct commercially reasonable age estimation to determine which consumers are minors, then the controller is not liable for erroneous age estimation.
Such a clarification is arguably necessary if “willfully disregards” is implied to require some level of affirmative action on a controller’s part to estimate users’ ages under certain circumstances. For example, the Florida Digital Bill of Rights regulations provide that a controller willfully disregards a consumer’s age if it “should reasonably have been aroused to question whether a consumer was a child and thereafter failed to perform reasonable age verification,” and it incentivizes age verification by providing that a controller will not be found to have willfully disregarded a consumer’s age if it used “a reasonable age verification method with respect to all of its consumers” and determined that the consumer was not a child. Montana takes a different approach, explicitly disclaiming any requirement to engage in age verification, but still incentivizing age estimation.
4. Changed applicability requirements expand the law’s reach.
Owing to its relatively low population, the MCDPA had the lowest numerical applicability thresholds of any of the state comprehensive privacy laws when the law was enacted in 2023. At that time, prior comprehensive privacy laws in Virginia, Colorado, Utah, Connecticut, Iowa, and Indiana all applied to controllers that either (1) control or process the personal data of at least 100,000 consumers (“the general threshold”), or (2) control or process the personal data of at least 25,000 consumers if the controller derived a certain percentage of its gross revenue from the sale of personal data. Montana broke that mold by lowering the general threshold to 50,000 affected consumers. Several states—Delware, New Hampshire, Maryland, and Rhode Island—have since surpassed Montana’s low-water mark. Accordingly, SB 297 lowers the law’s applicability thresholds. The law will now apply to controllers that either (1) control or process the personal data of at least 25,000 consumers, or (2) control or process the personal data of at least 15,000 consumers (down from 25,000) if the controller derives at least 25% of gross revenue from the sale of personal data.
Following a broader legislative trend in recent years, this bill also narrows or eliminates several entity-level exemptions. Most notably, the entity-level exemption for financial institutions and affiliates governed by the Gramm-Leach-Bliley Act has been narrowed to a data-level exemption, aligning with the approach taken by Oregon and Minnesota. To counterbalance this change, SB 297 adds new entity-level exemptions for certain chartered banks, credit unions, insurers, and third-party administrators of self-insurance engaged in financial activities. SB 297 also narrows the non-profit exemption to apply only to non-profits that are “established to detect and prevent fraudulent acts in connection with insurance.” Thus, Montana’s law now joins those of Colorado, Oregon, Delaware, New Jersey, Maryland, and Minnesota in broadly applying to non-profits.
5. The newly narrowed right to access now prohibits controllers from disclosing certain types of highly-sensitive information, such as social security numbers.
The consumer right to access one’s personal data carries a tension between the ability to access the specific data that an entity has collected concerning oneself and the risk that one’s data, especially one’s sensitive data, could be either erroneously or surreptitiously disclosed to a third party or even a bad actor. Responsive to that risk, SB 297 follows Minnesota’s approach by narrowing the right to access to prohibit disclosure of certain types of sensitive data. As amended, a controller now may not, in response to a consumer exercising their right to access their personal data, disclose the following information: social security number; government issued identification number (including driver’s license number); financial account number; health insurance account number or medical identification number; account password, security questions, or answer; or biometric data. If a controller has collected this information, rather than disclosing it, the controller must inform the consumer “with sufficient particularity” that it has collected the information.
SB 297 also slightly expands one of the law’s opt-out rights. Consumers can now opt out of profiling in furtherance of “automated decisions” that produce legal or similarly significant effects, rather than only “solely automated decisions.”
6. The MCDPA now includes more prescriptive privacy notice requirements.
SB 297 significantly expands the requirements for privacy notices and related disclosures, largely aligning with the more prescriptive provisions in Minnesota’s law. Changes made by SB 297 include—
Content: Privacy notices must now include an explanation of the law’s consumer rights and the date that the notice was updated. Controllers must now also include a “clear and conspicuous” method outside of the privacy notice for consumers to exercise their opt-out rights.
Form: A controller is required to provide a privacy notice in each language in which it provides products or services, and the privacy notices must be “reasonably accessible to and usable by individuals with disabilities.” Privacy notices must now be posted online on a controller’s website homepage through a “conspicuous hyperlink using the word ‘privacy.’” For mobile device applications, this hyperlink must be included in either the application’s store page or download page, and the application must include the hyperlink “in the application’s settings menu or in a similarly conspicuous and accessible location.”
Updates: Controllers are required to take “all reasonable electronic measures” to notify consumers of material changes to privacy notices or practices and to provide a “reasonable opportunity for consumers to withdraw consent to any further materially different collection, processing, or transfer of previously collected personal data.”
The law provides that controllers do not need to provide a separate, Montana-specific privacy notice or section of a privacy notice so long as the controller’s general privacy notice includes all information required by the MCDPA.
7. The Attorney General now has increased investigatory power.
Finally, SB 297 reworks the law’s enforcement provisions. The amendments build out the Attorney General’s (AG) investigatory powers by allowing the AG to exercise powers provided by the Montana Consumer Protection Act and Unfair Trade Practices laws, to issue civil investigative demands, and request that controllers disclose any data protection assessments that are relevant to an investigation. Furthermore, the AG is no longer required to offer an opportunity to cure before bringing an enforcement action, in effect closing the cure period six months prior to its previous scheduled expiration date. The statute of limitations is five years after a cause of action accrues.
Consent for Processing Personal Data in the Age of AI: Key Updates Across Asia-Pacific
This Issue Brief summarizes key developments in data protection laws across the Asia-Pacific region since 2022, when the Future of Privacy Forum (FPF) and the Asian Business Law Institute (ABLI) published a series of reports examining 14 jurisdictions in the region. We found that while many offer alternative legal bases for data processing, consent remains the most widely used, often due to its familiarity, despite known limitations.
This Issue Brief provides an updated view of evolving consent requirements and alternative legal bases for data processing across key APAC jurisdictions: India, Vietnam, Indonesia, the Philippines, South Korea, and Malaysia.
In August 2023, India passed the Digital Personal Data Protection Act (DPDPA). Once in force, the DPDPA will provide a comprehensive framework for processing personal data. It affirms consent as the primary basis for processing but introduces structured obligations around notice, purpose limitation, and consent withdrawal, while enabling future flexibility for alternative legal bases.
Vietnam‘s Decree on Personal Data Protection took effect in July 2023. It sets clearer standards for consent while formally recognizing alternative legal bases, including for contractual necessity and legal obligations. This marks a key step in broadening lawful processing options for businesses.
Indonesia’sPersonal Data Protection Law (PDPL), enacted in October 2022, introduces a unified national privacy law with an extended transition period. It affirms consent but also allows processing based on legitimate interest, public duties, and contract performance, bringing Indonesia closer to global privacy frameworks.
In November 2023, the Philippines‘ National Privacy Commission issued a Circular on Consent, clarifying valid consent standards and promoting transparency. The guidance aims to reduce consent fatigue by encouraging layered, contextual consent interfaces and outlines when consent may not be strictly necessary.
South Koreaamended PIPA (in force since September 2023) and related guidelines promote easy-to-understand consent practices and recognize additional legal grounds, especially in the context of AI. A 2025 bill is under consideration to expand the use of non-consent bases for AI-related processing.
The Personal Data Protection (Amendment) Act 2024, published in October 2024, introduces stronger enforcement tools and administrative penalties in Malaysia. While the amendments do not change the legal bases for processing, they enhance the compliance environment and signal stricter oversight.
The Issue Brief also explores how the rise of AI is impacting shifts in lawmaking and policymaking across the region, when it comes to lawful grounds for processing personal data.
As the APAC region shifts from fragmented, sector-specific rules to unified legal frameworks, understanding the evolving role of consent and the growing adoption of alternative legal bases is essential. From improving user-friendly consent mechanisms to strengthening enforcement and expanding lawful processing grounds, these changes highlight a more flexible and accountable approach to data protection across the region.
The Curse of Dimensionality: De-identification Challenges in the Sharing of Highly Dimensional Datasets
The 2006 release by AOL of search queries linked to individual users and the re-identification of some of those users is one of the best known privacy disasters in internet history. Less well known is that AOL had released the data to meet intense demand from academic researchers who saw this valuable data set as essential to understanding a wide range of human behavior.
As the executive appointed AOL’s first Chief Privacy Officer as part of a strategy to help prevent further privacy lapses, the benefits as well as the risks of sharing data became a priority in my work. At FPF, our teams have worked on every aspect of enabling privacy safe data sharing for research and social utility, including de-identification1, the ethics of data sharing, privacy-enhancing technologies2 and more3. Despite the skepticism of critics who maintain that reliable identification is a myth4, I maintain that it is hard, but for many data sets it is feasible, with the application of significant technical, legal and organizational controls. However, for highly dimensional data sets, or complex data sets that are made public or shared with multiple parties, the ability to provide strong guarantees at scale or without extensive impact on utility is far less feasible.
1. Introduction
The Value and Risk of Search Query Data
Search query logs constitute an unparalleled repository of collective human interest, intent, behavior, and knowledge-seeking activities. As one of the most common activities on the web, searching generates data streams that paint intimate portraits of individual lives, revealing interests, needs, concerns, and plans over time5. This data holds immense potential value for a wide range of applications, including improving search relevance and functionality, understanding societal trends, advancing scientific research (e.g., in public health surveillance or social sciences), developing new products and services, and fueling the digital advertising ecosystem.
However, the very richness that makes search data valuable also makes it exceptionally sensitive and fraught with privacy risks. Search queries frequently contain explicit personal information such as names, addresses, phone numbers, or passwords, often entered inadvertently by users. Beyond direct identifiers, queries are laden with quasi-identifiers (QIs) – pieces of information that, while not identifying in isolation, can be combined with other data points or external information to single out individuals. These can include searches related to specific locations, niche hobbies, medical conditions, product interests, or unique combinations of terms searched over time. Furthermore, the integration of search engines with advertising networks, user accounts, and other online services creates opportunities for linking search behavior with other extensive user profiles, amplifying the potential for privacy intrusions. The longitudinal nature of search logs, capturing behavior over extended periods, adds another layer of sensitivity, as sequences of queries can reveal evolving life circumstances, intentions, and vulnerabilities. The database reconstruction theorem, referred to as the fundamental law of information reconstruction, posits that publishing too much data derived from a confidential data source, at a high a degree of accuracy, will certainly after a finite number of queries result in the de-identification of the confidential data6. Extensive and extended releases of search data are a model example of this problem.
The De-identification Imperative and Its Inherent Challenges
Faced with the dual imperatives of leveraging valuable data and protecting user privacy, organizations rely heavily on data de-identification. De-identification encompasses a range of techniques aimed at removing or obscuring identifying information from datasets, thereby reducing the risk that the data can be linked back to specific individuals. The goal is to enable data analysis, research, and sharing while mitigating privacy harms and complying with legal and ethical obligations.
Despite its widespread use and appeal, de-identification is far from a perfected solution. Decades of research and numerous real-world incidents have demonstrated that supposedly “de-identified” or “anonymized” data have been re-identified, sometimes with surprising ease. This re-identification potential stems from several factors: the residual information left in the data after processing, the increasing availability of external datasets (auxiliary information) that can be linked to the de-identified data, and the continuous development of sophisticated analytical techniques. In some of these cases, a more rigorous de-identification process could have provided more effective protections, albeit with impact on the availability of the data needed. In other cases, the impact of the de-identification might “only” be a threat to public figures7. In my experience, expert technical and legal teams can collaborate to support reasonable de-identification efforts for data that is well structured or closely held, but for complex, high-dimensional datasets or data shared broadly, the risks multiply.
Furthermore, the terminology itself is fraught with ambiguity. “De-identification” is often used as a catch-all term, but it can range from simple masking of direct identifiers (which offers weak protection) to more rigorous attempts at achieving true anonymity, where the risk of re-identification is negligible. This ambiguity can foster a false sense of security, as techniques that merely remove names or obvious identifiers have too often been labeled as “de-identified” while still leaving individuals vulnerable. Achieving a state where individuals genuinely cannot be reasonably identified is significantly harder, especially given the inherent trade-off between privacy protection and data utility: more aggressive de-identification techniques reduce re-identification risk but also diminish the data’s value for analysis. The concept of true, irreversible anonymization, where re-identification is effectively impossible, represents a high standard that is particularly challenging to meet for rich behavioral datasets, especially when data is shared with additional parties or made public. For more limited data sets that can be kept private and secure, or shared with extensive controls and legal and technical oversight, effective de-identification that maintains utility while reasonably managing risk can be feasible. This gap between the promise of de-identification and the persistent reality of re-identification risk for rich data sets that are shared lies at the heart of the privacy challenges discussed in this article.
Report Objectives and Structure
This article provides an analysis of the challenges associated with de-identifying massive datasets of search queries. It aims to review the technical, practical, legal, and ethical complexities involved. The analysis will cover:
General De-identification Concepts and Techniques: Defining the spectrum of data protection methods and outlining common technical approaches.
Unique Characteristics of Search Data: Examining the properties of search logs (dimensionality, sparsity, embedded identifiers, longitudinal nature) that make de-identification particularly difficult.
The Re-identification Threat: Reviewing the mechanisms of re-identification attacks and landmark case studies (AOL, Netflix, etc.) where de-identification failed.
Limitations of Techniques: Assessing the vulnerabilities and shortcomings of various de-identification methods when applied to search data.
Harms and Ethics: Identifying the potential negative consequences of re-identification and exploring the ethical considerations surrounding user expectations, transparency, and consent.
The report concludes by synthesizing these findings to summarize the core privacy challenges, risks, and ongoing debates surrounding the de-identification of massive search query datasets.
2. Understanding Data De-identification
To analyze the challenges of de-identifying search queries, it is essential first to establish a clear understanding of the terminology and techniques involved in de-identification. The landscape includes various related but distinct concepts, each carrying different technical implications and legal weight.
Defining the Spectrum: De-identification, Anonymization, Pseudonymization8
The terms used to describe processes that reduce the linkability of data to individuals are often employed inconsistently, leading to confusion.
De-identification: This is often used as a broad, umbrella term referring to any process aimed at removing or obscuring personal information to reduce privacy risk. It encompasses a collection of methods and algorithms applied to data with the goal of making it harder, though not necessarily impossible, to link data back to specific individuals. De-identification is fundamentally an exercise in risk management rather than risk elimination.
Anonymization: While sometimes used interchangeably with de-identification, “anonymization” often implies a stricter standard, aiming for a state where the risk of re-identifying individuals is negligible or the process is effectively irreversible.
Pseudonymization: This specific technique involves replacing direct identifiers (like names or ID numbers) with artificial identifiers or pseudonyms. Because re-identification remains possible, pseudonymized data is explicitly considered personal data and remains subject to its rules. It is, however, recognized as a valuable security measure that can reduce risks9.
Key De-identification Techniques and Mechanisms
A variety of techniques can be employed, often in combination, to achieve different levels of de-identification or anonymization. Each has distinct mechanisms, strengths, and weaknesses:
Suppression/Omission/Redaction: This involves removing entire records or specific data fields (e.g., direct identifiers like names, specific quasi-identifiers deemed too risky). While highly effective at removing specific information, it can significantly reduce the dataset’s completeness and utility, especially if many fields or records are suppressed.
Masking: This technique obscures parts of data values without removing them entirely (e.g., showing only the first few digits of an IP address, replacing middle digits of an account number with ‘X’). It preserves data format but reduces precision. Its effectiveness depends on how much information remains.
Generalization: Specific values are replaced with broader, less precise categories. Examples include replacing an exact birth date with just the birth year or an age range, a specific ZIP code with a larger geographic area, or a specific occupation with a broader job category. This is a core technique used to achieve k-anonymity. While it reduces identifiability, excessive generalization can severely degrade data utility.
Aggregation: Data from multiple individuals is combined to produce summary statistics (e.g., counts, sums, averages, frequency distributions). This inherently hides individual-level data but can still be vulnerable to inference attacks (like differencing attacks, where comparing aggregates from slightly different groups reveals individual contributions) if not carefully implemented, potentially with noise addition. It also prevents analyses requiring individual records.
Noise Addition: Random values are deliberately added to the original data points or to the results of aggregate queries. The goal is to obscure the true values enough to protect individual privacy while preserving the overall statistical distributions and patterns in the data. The amount and type of noise must be carefully calibrated. This is the fundamental mechanism behind differential privacy.
Swapping (Permutation): Values for certain attributes are exchanged between different records in the dataset. For example, the locations of two users might be swapped. This preserves the marginal distributions (overall counts for each location) but introduces inaccuracies at the individual record level, potentially breaking links between attributes within a record.
Hashing: One-way cryptographic functions are applied to identifiers, transforming them into fixed-size hash values. While seemingly secure because hashes are hard to reverse directly, unsalted hashes are vulnerable to dictionary or rainbow table attacks (precomputed hash lookups). Even salted hashes can be vulnerable to brute-force attacks if the original input space is small or if keys are compromised. Secure implementation requires strong, unique salts per record and careful key management.
Pseudonymization: As defined earlier, identifiers are replaced with artificial codes or pseudonyms. The link between the pseudonym and the real identity is maintained (often separately), allowing potential re-identification.
k-Anonymity: This is a formal privacy model, not just a technique. It requires that each record in the released dataset be indistinguishable from at least k-1 other records based on a set of defined quasi-identifiers. It is typically achieved using generalization and suppression10. While preventing exact matching on QIs, it has known weaknesses:
Homogeneity Attack: If all k records in an equivalence class share the same sensitive attribute value, the attacker learns that attribute for anyone they can place in that class.
Background Knowledge Attack: An attacker might use external information to narrow down possibilities within an equivalence class.
Curse of Dimensionality: Becomes impractical for datasets with many QIs, requiring excessive generalization/suppression and utility loss11.
Compositionality: Combining multiple k-anonymous datasets does not guarantee k-anonymity for the combined data.
l-Diversity and t-Closeness: These are refinements of k-anonymity designed to address the homogeneity attack. l-diversity requires that each equivalence class (group of k indistinguishable records) contains at least l “well-represented” values for each sensitive attribute12. t-closeness imposes a stricter constraint, requiring that the distribution of sensitive attribute values within each equivalence class be close (within a threshold t) to the distribution of the attribute in the overall dataset13. While providing stronger protection against attribute disclosure, these models can be more complex to implement and may further reduce data utility compared to basic k-anonymity.
Differential Privacy (DP): A rigorous mathematical framework that provides provable privacy guarantees14. The core idea is that the output of a DP algorithm (e.g., an aggregate statistic, a machine learning model) should be statistically similar whether or not any particular individual’s data was included in the input dataset. This limits what an adversary can infer about any individual from the output. Privacy loss is quantified by parameters \epsilon (epsilon) and sometimes \delta (delta), where lower values mean stronger privacy. DP guarantees are robust against arbitrary background knowledge and compose predictably (the total privacy loss from multiple DP analyses can be calculated). Implementation typically involves adding carefully calibrated noise (e.g., Laplace or Gaussian) to outputs. The main challenge is the inherent trade-off between privacy (low \epsilon) and utility/accuracy (more noise reduces accuracy). Each release of additional data forces a new calculation, as risks increase, limiting the release of new sets of data. The application of DP to unstructured non-numeric data is less well developed.
Synthetic Data Generation: This approach involves creating an entirely artificial dataset that mimics the statistical properties and structure of the original sensitive dataset, but does not contain any real individual records15. Models (often statistical or machine learning models) are trained on the original data and then used to generate the synthetic data. If the generation process itself incorporates privacy protections like DP (e.g., training the generative model with DP-SGD16), the resulting synthetic data can inherit these privacy guarantees. Challenges include ensuring the synthetic data accurately reflects the nuances of the original data (utility) while avoiding the model memorizing and replicating sensitive information or outliers from the training set (privacy risk).
The following table provides a comparative overview of these techniques:
Table 1: Comparison of Common De-identification Techniques
Technique Name
Mechanism Description
Primary Goal
Key Strengths
Key Weaknesses/Limitations
Applicability to Search Logs
Suppression/ Redaction
Remove specific values or records
Remove specific identifiers/sensitive data
Simple; Effective for targeted removal
High utility loss if applied broadly; Doesn’t address linkage via remaining data
Low (Insufficient alone; high utility loss for QIs)
Masking
Obscure parts of data values (e.g., XXXX)
Obscure direct identifiers
Simple; Preserves format
Limited privacy protection; Can reduce utility; Hard for free text
Low (Insufficient for QIs in queries)
Generalization
Replace specific values with broader categories
Reduce identifiability via QIs
Basis for k-anonymity
Significant utility loss, especially in high dimensions (“curse of dimensionality”)
Inherits k-anonymity issues; Adds complexity; Further utility reduction
Low (Impractical due to k-anonymity’s base failure)
Differential Privacy (DP)
Mathematical framework limiting inference about individuals via noise
Provable privacy guarantee against inference/linkage
Strongest theoretical guarantees; Composable; Robust to auxiliary info
Utility/accuracy trade-off; Implementation complexity; Can be hard for complex queries
Low (Theoretically strongest, but practical utility for granular search data is a major hurdle)
Synthetic Data
Generate artificial data mimicking original statistics
Provide utility without real records
Can avoid direct disclosure of real data
Hard to ensure utility & privacy simultaneously; Risk of memorization/inference if model overfits; Bias amplification
Medium (Promising, but technically demanding for complex behavioral data like search, future potential, but research still early)
3. The Unique Nature and Privacy Sensitivity of Search Query Data
Search query data possesses several intrinsic characteristics that make it particularly challenging to de-identify effectively while preserving its analytical value. These properties distinguish it from simpler, structured datasets often considered in introductory anonymization examples.
High Dimensionality, Sparsity, and the “Curse of Dimensionality”
Search logs are inherently high-dimensional datasets. Each interaction potentially captures a multitude of attributes associated with a user or session: the query terms themselves, the timestamp of the query, the user’s IP address (providing approximate location), browser type and version, operating system, language settings, cookies or other identifiers linking sessions, the rank of clicked results, the URL or domain of clicked results, and potentially other contextual signals. When viewed longitudinally, the sequence of these interactions adds further dimensions representing temporal patterns and evolving interests.
Simultaneously, individual user data within this high-dimensional space is typically very sparse. Any single user searches for only a tiny fraction of all possible topics or keywords, clicks on a minuscule subset of the web’s pages, and exhibits specific patterns of activity at particular time17.
This combination of high dimensionality and sparsity poses a fundamental challenge known as the “curse of dimensionality18” in the context of data privacy. In high-dimensional spaces, data points tend to become isolated; the concept of a “neighbor” or “similar record” becomes less meaningful because points are likely to differ across many dimensions19. Consequently, even without explicit identifiers, the unique combination of attributes and behaviors across many dimensions can act as a distinct “fingerprint” for an individual user. This uniqueness makes re-identification through linkage or inference significantly easier.
The curse of dimensionality challenges traditional anonymization techniques like k-anonymity20. Since k-anonymity relies on finding groups of at least k individuals who are identical across all quasi-identifying attributes, the sparsity and uniqueness inherent in high-dimensional search data make finding such groups highly improbable without resorting to extreme measures. To force records into equivalence classes, one would need to apply such broad generalization (e.g., reducing detailed query topics to very high-level categories) or suppress so much data that the resulting dataset loses significant analytical value.
Implicit Personal Identifiers and Quasi-Identifiers in Queries
Beyond the metadata associated with a search (IP, timestamp, etc.), the content of the search queries themselves is a major source of privacy risk. Firstly, users frequently, though often unintentionally, include direct personal information within their search queries. This could be their own name, address, phone number, email address, social security number, account numbers, or similar details about others. The infamous AOL search log incident provided stark evidence of this, where queries directly contained names and location information that facilitated re-identification. Secondly, and perhaps more pervasively, search queries are rich with quasi-identifiers (QIs). These are terms, phrases, or concepts that, while not uniquely identifying on their own, become identifying when combined with each other or with external auxiliary information. Examples abound in the search context:
Queries about specific, non-generic locations (“restaurants near 123nd St,”, “best plumber in zip code 90210”, “landscapers in Lilburn, Ga” ).
Searches for rare medical conditions, treatments, or specific doctors/clinics.
Queries related to niche hobbies, specialized professional interests, or obscure products.
Searches including names of family members, friends, colleagues, or personal contacts.
Use of unique jargon, personal acronyms, or idiosyncratic phrasing.
Combinations of seemingly unrelated queries over a short period that reflect a specific user’s context or multi-faceted task (e.g., searching for a specific flight number, then a hotel near the destination airport, then restaurants in that area).
The challenge lies in the unstructured, free-text nature of search queries. Unlike structured databases where QIs like date of birth, gender, and ZIP code often reside in well-defined columns, the QIs in search queries are embedded within the semantic meaning and contextual background of the text string itself. Identifying and removing or generalizing all such potential QIs automatically is an extremely difficult task, particularly if done at large scale and by automated means. Standard natural language processing techniques might identify common entities like names or locations, but would struggle with the vast range of potentially identifying combinations and context-dependent sensitivities. Passwords or coded unique urls of private documents may be entered by users and impossible to recognize for automated redaction. This inherent difficulty in scrubbing QIs from unstructured query text makes search data significantly harder to de-identify reliably compared to structured data.
Temporal Dynamics and Longitudinal Linkability
Search logs are not static snapshots; they are longitudinal records capturing user behavior as it unfolds over time. A user’s search history represents a sequence of actions, reflecting evolving interests, ongoing tasks, changes in location, and shifts in life circumstances. This temporal dimension adds significant identifying power beyond that of individual, isolated queries.
Even if session-specific identifiers like cookies are removed or periodically changed, the continuity of a user’s behavior can allow for linking queries across different sessions or time periods. Consistent patterns (e.g., regularly searching for specific technical terms related to one’s profession), evolving interests (e.g., searches related to pregnancy progressing over months), or recurring needs (e.g., checking commute times) can serve as anchors to connect seemingly disparate query records back to the same individual. The sequence itself becomes a quasi-identifier. This poses a significant challenge for de-identification. Techniques applied cross-sectionally—treating each query or session independently—may fail to protect against longitudinal linkage attacks that exploit these behavioral trails. Effective de-identification of longitudinal data requires considering the entire user history, or at least sufficiently long windows of activity, to assess and mitigate the risk of temporal linkage. This inherently increases the complexity of the de-identification process and potentially necessitates even greater data perturbation or suppression to break these temporal links, further impacting utility. Anonymization techniques that completely sever links between records over time would prevent valuable longitudinal analysis altogether.
The Uniqueness and Re-identifiability Potential of Search Histories
The combined effect of high dimensionality, sparsity, embedded quasi-identifiers, and temporal dynamics results in search histories that are often highly unique to individual users. Research has repeatedly shown that even limited sets of behavioral data points can uniquely identify individuals within large populations. Latanya Sweeney’s seminal work demonstrated that 87% of the US population could be uniquely identified using just three quasi-identifiers: 5-digit ZIP code, gender, and full date of birth21. Search histories contain far more dimensions and potentially identifying attributes than this minimal set.
Studies on analogous high-dimensional behavioral datasets confirm this potential for uniqueness and re-identification. The successful de-anonymization of Netflix users based on a small number of movie ratings linked to public IMDb profiles is a prime example. Similarly, research has shown high re-identification rates for mobile phone location data and credit card transactions, purely based on the patterns of activity. Su and colleagues showed that de-identified web browsing histories can be linked to social media profiles using only publicly available data22. Given that search histories encapsulate a similarly rich and diverse set of user actions and interests over time, it is highly probable that many users possess unique or near-unique search “fingerprints” even after standard de-identification techniques (like removing IP addresses and user IDs) are applied. This inherent uniqueness makes search logs exceptionally vulnerable to re-identification, particularly through linkage attacks that correlate the de-identified search patterns with other available data sources. The simple assumption that removing direct identifiers is sufficient to protect privacy is demonstrably false for this type of rich, behavioral data. The very detail that makes search logs valuable for understanding behavior also makes them inherently difficult to anonymize effectively.
4. The Re-identification Threat: Theory and Practice
The potential for re-identification is not merely theoretical; it is a practical threat demonstrated through various attack methodologies and real-world incidents. Understanding these mechanisms is crucial for appreciating the limitations of de-identification for search query data.
Mechanisms of Re-identification: Linkage, Inference, and Reconstruction Attacks
Re-identification attacks exploit residual information in de-identified data or leverage external knowledge to uncover identities or sensitive attributes. Key mechanisms include:
Linkage Attacks: This is arguably the most common and well-understood re-identification method. It works by combining the target de-identified dataset with one or more external (auxiliary) datasets that share common attributes (quasi-identifiers). If an individual can be uniquely matched across datasets based on these shared QIs, then identifying information from one dataset (e.g., name from a voter registry) can be linked to sensitive information in the other (e.g., health conditions or search queries from the de-identified dataset). The success of linkage attacks depends heavily on the uniqueness of individuals based on the available QIs and the availability of suitable auxiliary datasets. Examples include linking de-identified hospital discharge data to public voter registration lists using ZIP code, date of birth, and gender; linking anonymized Netflix movie ratings to public IMDb profiles using shared movie ratings and dates; and linking browsing histories to social media accounts based on clicked links.
Inference Attacks: These attacks aim to deduce new information about individuals, which may include their identity or sensitive attributes, often by exploiting statistical patterns or weaknesses in the de-identification method itself, sometimes without requiring explicit linkage to a named identity. Common types include:
Membership Inference: An attacker attempts to determine whether a specific, known individual’s data was included in the original dataset used to generate the de-identified data or train a model. This can be harmful if membership itself reveals sensitive information (e.g., inclusion in a dataset of individuals with a specific disease). Outliers in the data are often more vulnerable to this type of attack. Synthetic data generated by models that overfit the training data can be particularly susceptible.
Attribute Inference: An attacker tries to infer the value of a hidden or sensitive attribute for an individual based on their other known attributes in the de-identified data or based on the output of a model trained on the data. For example, inferring a likely medical condition based on a pattern of related searches.
Property Inference: An attacker seeks to learn aggregate properties or statistics about the original sensitive dataset that were not intended to be revealed.
Reconstruction Attacks: These attacks aim to reconstruct, partially or fully, the original sensitive data records from the released de-identified data, aggregate statistics, or machine learning models. This might involve combining information from multiple anonymized datasets or cleverly querying an anonymized database multiple times to piece together individual records. The increasing sophistication of AI and machine learning models provides new avenues for reconstruction attacks, for instance, by training models to reverse anonymization processes or reconstruct text from embeddings.
Other Mechanisms: Re-identification can also occur due to simpler failures:
Insufficient De-identification: Direct or obvious quasi-identifiers are simply missed during the scrubbing process, particularly in unstructured data like free text or notes.
Pseudonym Reversal: If the method used to generate pseudonyms is weak, predictable, or the key/algorithm is compromised, the original identifiers can be recovered. The NYC Taxi data incident, where medallion numbers were hashed using a known, reversible method, exemplifies this.
The threat landscape for re-identification is diverse and evolving. While linkage attacks relying on external data remain a primary concern, inference and reconstruction attacks, potentially powered by advanced AI/ML techniques, pose growing risks even to datasets processed with sophisticated methods. This necessitates robust privacy protections that anticipate a wide range of potential attack vectors.
Landmark Case Study: The AOL Search Log Release (2006)
In August 2006, AOL publicly released a dataset containing approximately 20 million search queries made by over 650,000 users during a three-month period. The data was intended for research purposes and was presented as “anonymized.” The primary anonymization step involved replacing the actual user identifiers with arbitrary numerical IDs. However, the dataset retained the raw query text, query timestamps, and information about clicked results (rank and domain URL). Later statements suggest IP address and cookie information were also altered, though potentially insufficiently.
The attempt at anonymization failed dramatically and rapidly. Within days, reporters Michael Barbaro and Tom Zeller Jr. of The New York Times were able to re-identify one specific user, designated “AOL user No. 4417749,” as Thelma Arnold, a 62-year-old widow living in Lilburn, Georgia23. They achieved this by analyzing the sequence of queries associated with her user number. The queries contained a potent mix of quasi-identifiers, including searches for “landscapers in Lilburn, Ga,” searches for individuals with the surname “Arnold,” and searches for “homes sold in shadow lake subdivision gwinnett county georgia,” alongside other personally revealing (though not directly identifying) queries like “numb fingers,” “60 single men,” and “dog that urinates on everything.” The combination of these queries created a unique pattern easily traceable to Ms. Arnold through publicly available information.
The AOL incident became a watershed moment in data privacy. It starkly demonstrated several critical points relevant to search data de-identification:
Removing explicit user IDs is fundamentally insufficient when the underlying data itself contains rich identifying information.
Search queries, even seemingly innocuous ones, are laden with Personally Identifiable Information (PII) and powerful quasi-identifiers embedded in the text.
The temporal sequence of queries provides crucial context and significantly increases identifiability.
Linkage attacks using query content combined with publicly available information are feasible and effective.
Simple anonymization techniques fail to account for the identifying power of combined attributes and behavioral patterns.
The incident led to significant public backlash, the resignation of AOL’s CTO, and a class-action lawsuit. It remains a canonical example of the pitfalls of naive de-identification and the unique sensitivity of search query data.
Landmark Case Study: The Netflix Prize De-anonymization (2007-2008)
In 2006, Netflix launched a public competition, the “Netflix Prize,” offering $1 million to researchers who could significantly improve the accuracy of its movie recommendation system. To facilitate this, Netflix released a large dataset containing approximately 100 million movie ratings (1-5 stars, plus date) from nearly 500,000 anonymous subscribers, collected between 1998 and 2005. User identifiers were replaced with random numbers, and any other explicit PII was removed.
In 2007, researchers Arvind Narayanan and Vitaly Shmatikov published a groundbreaking paper demonstrating how this supposedly anonymized dataset could be effectively de-anonymized24. Their attack relied on linking the Netflix data with a publicly available auxiliary dataset: movie ratings posted by users on the Internet Movie Database (IMDb).
They developed statistical algorithms that could match users across the two datasets based on shared movie ratings and the approximate dates of those ratings. Their key insight was that while many users might rate popular movies similarly, the combination of ratings for less common movies, along with the timing, created unique signatures. They showed that an adversary knowing only a small subset (as few as 2, but more reliably 6-8) of a target individual’s movie ratings and approximate dates could, with high probability, uniquely identify that individual’s complete record within the massive Netflix dataset. Their algorithm was robust to noise, meaning the adversary’s knowledge didn’t need to be perfectly accurate (e.g., dates could be off by weeks, ratings could be slightly different).
Narayanan and Shmatikov successfully identified the Netflix records corresponding to several non-anonymous IMDb users, thereby revealing their potentially private Netflix viewing histories, including ratings for sensitive or politically charged films that were not part of their public IMDb profiles.
The Netflix Prize de-anonymization study had significant implications:
It demonstrated the vulnerability of high-dimensional, sparse datasets (characteristic of much behavioral data, including search logs) to linkage attacks.
It proved that even seemingly non-sensitive data (movie ratings) can become identifying when combined with auxiliary information.
It highlighted the inadequacy of simply removing direct identifiers and replacing them with pseudonyms when dealing with rich datasets.
It underscored the power of publicly available auxiliary data in undermining anonymization efforts.
The research led to a class-action lawsuit against Netflix alleging privacy violations and the subsequent cancellation of a planned second Netflix Prize competition due to privacy concerns raised by the Federal Trade Commission (FTC). It remains a pivotal case study illustrating the fragility of anonymization for behavioral data.
Other Demonstrations of Re-identification Across Data Types
The AOL and Netflix incidents are not isolated cases. Numerous studies and breaches have demonstrated the feasibility of re-identifying individuals from various types of supposedly de-identified data, reinforcing the systemic nature of the challenge, especially for rich, individual-level records.
Health Data: The re-identification of Massachusetts Governor William Weld’s health records in the 1990s by Latanya Sweeney, using public voter registration data (ZIP code, date of birth, gender) linked to de-identified hospital discharge summaries, was an early warning. More recently, researchers re-identified patients in a publicly released dataset of Australian medical billing (MBS/PBS) information, despite assurances of anonymity, again using linkage techniques. Genomic data also poses significant risks; individuals have been re-identified from aggregate genomic data shared through research beacons via repeated querying or linkage to genealogical databases. Clinical notes containing narrative descriptions of events, like motor vehicle accidents, have also been used to re-identify patients by linking details to external reports. These incidents raise questions about the adequacy of standards like HIPAA’s Safe Harbor method for de-identification25.
Location and Mobility Data: The release of New York City taxi trip data in 2014 led to re-identification of drivers and exposure of their earnings and movements because the supposedly anonymized taxi medallion numbers were hashed using a weak, easily reversible method. Studies analyzing mobile phone location data (cell tower or GPS traces) have shown that just a few spatio-temporal points are often sufficient to uniquely identify an individual due to the distinctiveness of human movement patterns26.
Financial Data: Research by de Montjoye et al. demonstrated that even with coarse location and time information, just four points were often enough to uniquely identify individuals within a dataset of 1.1 million people’s credit card transactions over three months27.
Social Media and Browsing Data: Su et al. showed web browsing histories could be linked to social media profiles28. Other studies have explored re-identification risks in social network graphs based on connection patterns.
The following table summarizes some of these key incidents:
Table 2: Summary of Notable Re-identification Incidents
Incident Name/Year
Data Type
“Anonymization” Method Used
Re-identification Method
Auxiliary Data Used
Key Finding/Significance
MA Governor Weld (1990s)
Hospital Discharge Data
Removal of direct identifiers (name, address, SSN)
Linkage Attack
Public Voter Registration List (ZIP, DoB, Gender)
Early demonstration that QIs in supposedly de-identified data allow linkage to identified data.
AOL Search Logs (2006)
Search Queries
User ID replaced with number; Query text, timestamps retained
Linkage/Inference from Query Content
Public knowledge, location directories
Search queries themselves contain rich PII/QIs enabling re-identification. Simple ID removal is insufficient.
Netflix Prize (2007-8)
Movie Ratings (user, movie, rating, date)
User ID replaced with number
Linkage Attack
Public IMDb User Ratings
High-dimensional, sparse behavioral data is vulnerable. Small amounts of auxiliary data can enable re-id.
NYC Taxis (2014)
Taxi Trip Records (incl. hashed medallion/license)
Weak (MD5) hashing of identifiers
Pseudonym Reversal (Hash cracking)
Knowledge of hashing algorithm
Poorly chosen pseudonymization (weak hashing) is easily reversible.
Australian Health Records (MBS/PBS) (2016)
Medical Billing Data
Claimed de-identification (details unclear)
Linkage Attack
Publicly available information (e.g., birth year, surgery dates)
Government-released health data, claimed anonymous, was re-identifiable.
Browsing History / Social Media
Web Browsing History
Assumed de-identified (focus on linking)
Linkage Attack
Social Media Feeds (e.g., Twitter)
Unique patterns of link clicking in browsing history mirror unique social feeds, enabling linkage.
Sparse transaction data is highly unique; few points needed for re-identification.
Location Data (Various studies)
Mobile Phone Location Traces
Various (often simple ID removal or aggregation)
Uniqueness Analysis / Linkage Attack
Maps, Points of Interest, Public Records
Human mobility patterns are highly unique; location data is easily re-identifiable..
These examples collectively illustrate that re-identification is not a niche problem confined to specific data types but a systemic risk inherent in sharing or releasing granular data about individuals, especially when that data captures complex behaviors over time or across multiple dimensions. Search query logs share many characteristics with these vulnerable datasets (high dimensionality, sparsity, behavioral patterns, embedded QIs, longitudinal nature), strongly suggesting they face similar, if not greater, re-identification risks.
The Critical Role of Auxiliary Information
A recurring theme across nearly all successful re-identification demonstrations is the crucial role played by auxiliary information. This refers to any external data source or background knowledge an attacker possesses or can obtain about individuals, which can then be used to bridge the gap between a de-identified record and a real-world identity.
The sources of auxiliary information are vast and continuously expanding in the era of Big Data:
Public Records: Voter registration lists, property ownership records, professional license databases, court records, census data summaries, etc.
Social Media and Online Profiles: Publicly visible information on platforms like Facebook, Twitter/X, LinkedIn, IMDb, personal blogs, forums, etc., containing names, locations, interests, connections, activities, and opinions.
Commercial Data Brokers: Companies that aggregate and sell detailed profiles on individuals, compiled from diverse sources including purchasing history, online behavior, demographics, financial information, etc.
Other Breached or Leaked Data: Datasets exposed through security breaches can become auxiliary information for attacking other datasets.
Academic or Research Data: Publicly released datasets from previous research studies.
Personal Knowledge: Information an attacker knows about a specific target individual (e.g., their approximate age, place of work, recent activities, known associates).
The critical implication is that the privacy risk associated with a de-identified dataset cannot be assessed in isolation. Its vulnerability depends heavily on the external data ecosystem and what information might be available for linkage. De-identification performed today might be broken tomorrow as new auxiliary data sets become available or linkage techniques improve. This makes robust anonymization a moving target. Any assessment of re-identification risk must therefore be contextual, considering the specific data being released, the intended recipients or release environment, and the types of auxiliary information reasonably available to potential adversaries. Relying solely on removing identifiers without considering this broader context creates a fragile and likely inadequate privacy protection strategy.
5. Limitations of De-identification Techniques on Search Data
Given the unique characteristics of search query data and the demonstrated power of re-identification attacks, it is essential to critically evaluate the limitations of specific de-identification techniques when applied to this context.
The Fragility of k-Anonymity in High-Dimensional, Sparse Data
As established in Section 3.1, k-anonymity aims to protect privacy by ensuring that any individual record in a dataset is indistinguishable from at least k-1 other records based on their quasi-identifier (QI) values. This is typically achieved through generalization (making QI values less specific) and suppression (removing records or values).
However, k-anonymity proves fundamentally ill-suited for high-dimensional and sparse datasets like search logs. The core problem lies in the “curse of dimensionality”:
Uniqueness: In datasets with many attributes (dimensions), individual records tend to be unique or nearly unique across the combination of those attributes. Finding k search users who have matching patterns across numerous QIs (specific query terms, timestamps, locations, click behavior, etc.) is highly improbable.
Utility Destruction: To force records into equivalence classes of size k, massive amounts of generalization or suppression are required. Generalizing query terms might mean reducing specific searches like “side effects of lisinopril” to a broad category like “health query,” destroying the semantic richness crucial for analysis. Suppressing unique or hard-to-group records could eliminate vast portions of the dataset. This results in an unacceptable level of information loss, potentially rendering the data useless for its intended purpose.
Vulnerability to Attacks: Even if k-anonymity is technically achieved, it remains vulnerable. The homogeneity attack occurs if all k records in a group share the same sensitive attribute (e.g., all searched for the same sensitive topic), revealing that attribute for anyone linked to the group. Background knowledge attacks can allow adversaries to further narrow down possibilities within a group.
Refinements like l-diversity and t-closeness attempt to address attribute disclosure vulnerabilities by requiring diversity or specific distributional properties for sensitive attributes within each group. However, they inherit the fundamental problems of k-anonymity regarding high dimensionality and utility loss, while adding implementation complexity. Furthermore, k-anonymity lacks robust compositionality; combining multiple k-anonymous releases does not guarantee privacy. Therefore, k-anonymity and its derivatives face challenges when used for de-identifying massive, complex search logs. They force difficult choices between retaining minimal utility or providing inadequate privacy protection against linkage and inference attacks.
Differential Privacy: The Utility-Privacy Trade-off and Implementation Hurdles
Differential Privacy (DP) offers a fundamentally different approach, providing mathematically rigorous, provable privacy guarantees29. Instead of modifying data records directly to achieve indistinguishability, DP focuses on the output of computations (queries, analyses, models) performed on the data. It ensures that the result of any computation is statistically similar whether or not any single individual’s data is included in the input dataset. This is typically achieved by adding carefully calibrated random noise to the computation’s output.
DP’s strengths are significant: its guarantees hold regardless of an attacker’s auxiliary knowledge, and privacy loss (quantified by \epsilon and \delta) composes predictably across multiple analyses. However, applying DP effectively to massive search logs presents substantial challenges:
Applicability to Complex Queries and Data Types: DP is well-understood for basic aggregate queries (counts, sums, averages, histograms) on numerical or categorical data. Applying it effectively to the complex structures and query types relevant to search logs—such as analyzing free-text query semantics, mining sequential patterns in user sessions, building complex machine learning models (e.g., for ranking or recommendations), or analyzing graph structures (e.g., click graphs)—is more challenging and an active area of research. Standard DP mechanisms might require excessive noise or simplification for such tasks. Techniques like DP-SGD (Differentially Private Stochastic Gradient Descent) exist for training models, but again involve utility trade-offs30.
The Utility-Privacy Trade-off31: This is the most fundamental challenge. The strength of the privacy guarantee (lower \epsilon) is inversely proportional to the amount of noise added. More noise provides better privacy but reduces the accuracy and utility of the results. For the complex, granular analyses often desired from search logs (e.g., understanding rare query patterns, analyzing specific user journeys, training accurate prediction models), the amount of noise required to achieve a meaningful level of privacy (a small \epsilon) might overwhelm the signal, rendering the results unusable. While DP performs better on larger datasets where individual contributions are smaller, the sensitivity of queries on sparse, high-dimensional data can still necessitate significant noise. Finding an acceptable balance between privacy and utility for diverse use cases remains a major hurdle.
Implementation Complexity and Correctness: Implementing DP correctly requires significant expertise in both the theory and the practical nuances of noise calibration, sensitivity analysis (bounding how much one individual can affect the output), and privacy budget management. Errors in implementation, such as underestimating sensitivity or mismanaging the privacy budget across multiple queries (due to composition rules), can silently undermine the promised privacy guarantees. Defining the “privacy unit” (e.g., user, query, session) appropriately is critical; misclassification can lead to unintended disclosures. Auditing DP implementations for correctness is also non-trivial.
Local vs. Central Models: DP can be implemented in two main models. In the central model, a trusted curator collects raw data and then applies DP before releasing results. This generally allows for higher accuracy (less noise for a given \epsilon) but requires users to trust the curator with their raw data. In the local model (LDP), noise is added on the user’s device before data is sent to the collector. This offers stronger privacy guarantees as the collector never sees raw data, but typically requires significantly more noise to achieve the same level of privacy, often leading to much lower utility. The choice of model impacts both trust assumptions and achievable utility.
In essence, while DP provides the gold standard in theoretical privacy guarantees, its practical application to the scale and complexity of search logs involves significant compromises in data utility and faces non-trivial implementation hurdles. It is not a simple “plug-and-play” solution for making granular search data both private and fully useful.
Inadequacies of Aggregation, Masking, and Generalization for Search Logs
Simpler, traditional de-identification techniques prove largely insufficient for protecting privacy in search logs while preserving meaningful utility:
Aggregation: Releasing only aggregate statistics (e.g., total searches for “flu symptoms” per state per week) hides individual query details but destroys the granular, user-level information needed for many types of analysis, such as understanding user behavior sequences, personalization, or detailed linguistic analysis. Furthermore, aggregation alone is not immune to privacy breaches. Comparing aggregate results across slightly different populations or time periods (differencing attacks) can potentially reveal information about individuals or small groups. Releasing too many different aggregate statistics on the same underlying data also increases leakage risk through reconstruction attacks.
Masking/Suppression: As the AOL case vividly illustrates, simply masking or suppressing direct identifiers like user IDs or IP addresses is inadequate when the content itself (the queries) is identifying. Attempting to mask or suppress all potential quasi-identifiers within the free-text queries is practically infeasible due to the unstructured nature of the data and the sheer volume of potential identifiers (see Section 3.2). Suppressing entire queries or user records deemed risky would lead to massive data loss and biased results.
Generalization: Applying generalization to search query text would require replacing specific, meaningful terms with broad, vague categories (e.g., replacing “best Italian restaurant near Eiffel Tower” with “food query” or “location query”). This level of abstraction would obliterate the semantic nuances and specific intent captured in search queries, rendering the data useless for most research and operational purposes. The utility loss associated with generalization needed to achieve even weak privacy guarantees like k-anonymity in such high-dimensional data is prohibitive.
These foundational techniques, while potentially useful as components within a more sophisticated strategy (e.g., aggregation combined with differential privacy), are individually incapable of addressing the complex privacy challenges posed by massive search query datasets without sacrificing the data’s core value. As we discuss further, even combined they fall short.
Challenges with Synthetic Data Generation for Complex Behavioral Data
Generating synthetic data—artificial data designed to mirror the statistical properties of real data without containing actual individual records—has emerged as a promising privacy-enhancing technology. It offers the potential to share data insights without sharing real user information. However, creating high-quality, privacy-preserving synthetic search logs faces significant hurdles32:
Utility Preservation: Search logs capture complex patterns: semantic relationships between query terms, sequential dependencies in user sessions, temporal trends, correlations between queries and clicks, and vast individual variability. Training a generative model (e.g., a statistical model or a deep learning model like an LLM) to accurately capture all these nuances without access to the original data is extremely challenging. If the synthetic data fails to replicate these properties faithfully, it will have limited utility for downstream tasks like training accurate machine learning models or conducting reliable behavioral research. Generating realistic sequences of queries that maintain semantic coherence and plausible user intent is particularly difficult.
Privacy Risks (Memorization and Inference): Generative models, especially large and complex ones like LLMs, run the risk of “memorizing” or “overfitting” to their training data. If this happens, the model might generate synthetic examples that are identical or very close to actual records from the sensitive training dataset, thereby leaking private information. This risk is often higher for unique or rare records (outliers) in the original data. Even if exact records aren’t replicated, the synthetic data might still be vulnerable to membership inference attacks, where an attacker tries to determine if a specific person’s data was used to train the generative model. Ensuring the generation process itself is privacy-preserving, for example by using DP during model training is crucial but adds complexity and can impact the fidelity (utility) of the generated data. Evaluating the actual privacy level achieved by synthetic data is also a complex task.
Bias Amplification: Generative models learn patterns from the data they are trained on. If the original search log data contains societal biases (e.g., stereotypical associations, skewed representation of demographic groups), the synthetic data generated is likely to replicate, and potentially even amplify, these biases. This can lead to unfair or discriminatory outcomes if the synthetic data is used for training downstream applications.
Therefore, while synthetic data holds promise, generating truly useful and private synthetic search logs is a frontier research problem. The very complexity that makes search data valuable also makes it incredibly difficult to synthesize accurately without inadvertently leaking information or perpetuating biases. It requires sophisticated modeling techniques combined with robust privacy-preserving methods like DP integrated directly into the generation workflow.
6. Harms, Ethics, and Societal Implications
The challenges of de-identifying search query data are not merely technical or legal; they extend into architectural and organizational domains that fundamentally shape privacy outcomes. How data is released—through what mechanisms, under what controls, and with what oversight—represents an architectural problem bound by organizational principles and norms. The key architectural building block lies in the design of APIs (Application Programming Interfaces), which can act as critical shields between raw data and external access. Re-identification attempts can be partially mitigated at the API level through strict query limits, access controls, auditing mechanisms, and purpose restrictions—complementing the privacy-enhancing technologies discussed throughout this paper. These architectural choices embed ethical values and reflect organizational commitments to privacy beyond mere technical implementation. They carry significant weight and potential for real-world harm if privacy is compromised. These controls can perhaps be observed and managed at an individual organizational level, with extensive oversight and a data protection legal regime including enforcement in place, but are challenging to envision for ongoing large scale access to data by multiple unrelated independent parties. Once data is released, it is beyond the control of the API. Cutting off future API access when multiple releases create a re-identification risk may not be feasible. Knowing whether multiple API users collaborate or combine data is also a limitation.
Potential Harms from Re-identified Search Data: From Embarrassment to Discrimination
If supposedly de-identified search query data is successfully re-linked to individuals, the consequences can range from personal discomfort to severe, tangible harms. Search histories can reveal extremely sensitive aspects of a person’s life, including:
Health conditions and concerns (searches for symptoms, diseases, treatments, doctors).
Financial status (searches for loans, debt consolidation, specific products, income levels).
Sexual orientation or gender identity (searches related to LGBTQ+ topics, dating sites, transitioning).
Political or religious beliefs (searches for specific groups, ideologies, places of worship).
Location and movement patterns (searches for addresses, directions, local services).
Personal interests, relationships, and vulnerabilities.
The exposure of such information through re-identification can lead to a spectrum of harms:
Embarrassment, Shame, and Reputational Damage: Public revelation of private searches or interests can cause significant personal distress and social stigma. The experience of Thelma Arnold, whose personal life was laid bare through her AOL search queries, or the potential exposure of sensitive movie preferences in the Netflix case , illustrate this risk. Reputational harm can affect personal relationships and professional standing.
Discrimination: Re-identified data revealing health status, ethnicity, religion, sexual orientation, financial vulnerability, or other characteristics could be used to discriminate against individuals in critical areas like employment, insurance (health, life, long-term care), credit, housing, or access to other opportunities. Profiling based on inferred characteristics from search data can lead to biased decision-making and exclusion.
Stigmatization: Disclosure of sensitive information, such as an HIV diagnosis inferred from searches, mental health struggles, or affiliation with marginalized groups, can lead to social isolation and prejudice.
Financial Harm: Re-identified data can facilitate identity theft, financial fraud, or targeted scams. It could also enable discriminatory pricing practices based on inferred user characteristics or willingness to pay.
Physical Harm and Safety Risks: Information about an individual’s location, routines, or vulnerabilities derived from search history could be exploited for stalking, harassment, physical intimidation, or other forms of violence.
Psychological Harm: The mere knowledge or fear of being surveilled, profiled, or having one’s private thoughts exposed can cause significant anxiety, stress, and a feeling of powerlessness or loss of control. Data breaches involving sensitive information are known to cause emotional distress.
These potential harms underscore the high stakes involved in handling search query data. The impact extends beyond individual privacy violations to potential societal harms, such as reinforcing existing inequalities through discriminatory profiling or undermining trust in digital services. Critically, legal systems often struggle to recognize and provide remedies for many of these harms, particularly those that are non-financial, cumulative, or relate to future risks.
7. Conclusion: Synthesizing the Challenges and Risks
The de-identification of massive search query datasets presents a complex and formidable challenge, sitting at the intersection of immense data value and profound privacy risk. While the potential benefits of analyzing search behavior for societal good, service improvement, and innovation are undeniable, the inherent nature of this data makes achieving meaningful privacy protection through de-identification exceptionally difficult.
The Core Privacy Paradox of Search Data De-identification
The fundamental paradox lies in the richness of the data itself. Search logs capture a high-dimensional, sparse, and longitudinal record of human intent and behavior. This richness, containing myriad explicit and implicit identifiers and quasi-identifiers embedded within unstructured query text and temporal patterns, creates unique individual fingerprints. Consequently, techniques designed to obscure identity often face a stark trade-off: either they fail to adequately protect against re-identification attacks (especially linkage attacks leveraging the vast ecosystem of auxiliary data ), or they must apply such aggressive generalization, suppression, or noise addition that the data’s analytical utility is severely compromised.
Traditional methods like k-anonymity are fundamentally crippled by the “curse of dimensionality” inherent in this data type. More advanced techniques like differential privacy offer stronger theoretical guarantees but introduce significant practical challenges related to the privacy-utility balance, implementation complexity, and applicability to the diverse analyses required for search data. Synthetic data generation, while promising, faces similar difficulties in capturing complex behavioral nuances without leaking information or amplifying bias.
Summary of Key Risks and Vulnerabilities
The analysis presented in this report highlights several critical risks associated with attempts to de-identify search query data:
High Re-identification Risk: Due to the data’s uniqueness and the power of linkage attacks using auxiliary information, the risk of re-identifying individuals from processed search logs remains substantial. Landmark failures like the AOL and Netflix incidents serve as potent warnings.
Inadequacy of Simple Techniques: Basic methods like removing direct identifiers, masking, simple aggregation, or naive generalization are insufficient to protect against sophisticated attacks on this type of data.
Limitations of Advanced Techniques: Even state-of-the-art methods like differential privacy and synthetic data generation face significant hurdles in balancing provable privacy with practical utility for complex, granular search data analysis.
Evolving Threat Landscape: The continuous growth of available data and the increasing sophistication of analytical techniques, including AI/ML-driven attacks, mean that re-identification risks are dynamic and likely increasing over time.
Potential for Serious Harm: Re-identification can lead to tangible harms, including discrimination, financial loss, reputational damage, psychological distress, and chilling effects on free expression and inquiry.
The Ongoing Debate
The challenges outlined fuel an ongoing debate about the viability and appropriate role of de-identification in the context of large-scale behavioral data. While organizations invest in Privacy Enhancing Technologies (PETs) and implement policies aimed at protecting user privacy, the demonstrable risks and technical limitations suggest that achieving true, robust anonymity for granular search query data, while maintaining high utility, remains an elusive goal.
During the preparation of this work the author used ChatGPT to reword and rephrase text and for a first draft of the two charts in the document. After using this tool/service, the author reviewed and edited the content as needed and takes full responsibility for the content of the publication.
Polonetsky, Tene and Finch: https://digitalcommons.law.scu.edu/cgi/viewcontent.cgi?article=2827&context=lawreview ↩︎
We note the European Court of Justice Breyer decision and subsequent EU court decisions that may open up a legal argument that it may be possible to consider a party that does not reasonably have potential access to the additional data to be in possession of non-personal data. https://curia.europa.eu/juris/document/document.jsf?docid=184668&doclang=EN ↩︎
Aggarwal, Charu C. (2005). “On k-Anonymity and the Curse of Dimensionality”. VLDB ’05 – Proceedings of the 31st International Conference on Very large Data Bases. Trondheim, Norway. CiteSeerX10.1.1.60.3155↩︎
Marcus Olson:https://marcusolsson.dev/k-anonymity-and-l-diversity/ ↩︎
Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian, “t-Closeness: Privacy Beyond k-Anonymity and ℓ-Diversity,” Proceedings of the 23rd IEEE International Conference on Data Engineering (2007 ↩︎
Dwork, C. (2006). Differential Privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds) Automata, Languages and Programming. ICALP 2006. Lecture Notes in Computer Science, vol 4052. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11787006_1 ↩︎
Cynthia Dwork, “Differential Privacy,” in Automata, Languages and Programming, 33rd International Colloquium, ICALP 2006, Proceedings, Part II, ed. Michele Bugliesi et al., Lecture Notes in Computer Science 4052 (Berlin: Springer, 2006) ↩︎
Guidelines for Evaluating Differential Privacy Guarantees – NIST Technical Series Publications, https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-226.pdf ↩︎
Privacy Tech-Know blog: When what is old is new again – The reality of synthetic data, https://www.priv.gc.ca/en/blog/20221012/ 95 ↩︎
FPF Launches Major Initiative to Study Economic and Policy Implications of AgeTech
FPF and University of Arizona Eller College of Management Awarded Grant by Alfred P. Sloan Foundation to Address Privacy Implications, and Data Uses of Technologies Aimed at Aging At Home
The Future of Privacy Forum (FPF) — a global non-profit focused on data protection, AI and emerging technologies–has been awarded a grant from the Alfred P. Sloan Foundation to lead a two-year research project entitled Aging at Home: Caregiving, Privacy, and Technology, in partnership with the University of Arizona Eller College of Management. The project, which launched on April 1, will explore the complex intersection of privacy, economics, and the use of emerging technologies designed to support aging populations (“AgeTech”). AgeTech includes a wide range of applications and technologies, from fall detection devices and health monitoring apps to artificial intelligence (AI)-powered assistants.
The number of seniors eighty-five and older is expected to nearly double by 2035 and nearly triple by 2060. This rapidly aging population presents complex challenges and opportunities, particularly in the increased demand for resources necessary for senior care and the use of AgeTech to promote improved autonomy and independence.
FPF will lead rigorous, independent research into these issues, with a particular focus on the privacy expectations of seniors and caregivers, cost barriers to adoption, and the policy gaps surrounding AgeTech. The research will include experimental surveys, roundtables with industry and policy leaders, and a systematic review of economic and privacy challenges facing AgeTech solutions.
The project will be led by co-principals Jules Polonetsky, CEO of FPF, and Dr. Laura Brandimarte, Associate Professor of Management Information Systems at the University of Arizona Eller College of Management. Polonetsky is an internationally recognized privacy expert and co-editor of the Cambridge Handbook on Consumer Privacy. Brandimarte’s work focused on the ethics of technology, with an emphasis on privacy and security, uses quantitative methods including survey and experimental design, and econometric data analysis.
Jordan Wrigley, a data and policy analyst who leads FPF health data research, will play a lead role for FPF along with members of FPF’s U.S., Global, and AI Policy teams. Jordan is a recognized and awarded health meta-analytic methodologist and researcher, whose work has informed medical care guidelines and AI data practices.
“The privacy aspects of AgeTech, such as consent and authorization, data sensitivity, and cost, need to be studied and considered holistically to create sustainable policies and build trust with seniors and caregivers as the future of aging becomes the present,” said Wrigley. “This research will seek to do just that.”
“At FPF, we believe that technology and data can benefit society and improve lives when the right laws, policies, and safeguards are in place,” added Polonetsky. “The goal of AgeTech – to assist seniors in living independently while reducing healthcare costs and caregiving burdens – impacts us all. As this field grows, it’s essential that we have the right rules in place to protect privacy and preserve dignity.”
“Technology has the potential to increase the autonomy and overall wellbeing of an ageing population, but for that to happen there has to be trust on the part of users – both that the technology will effectively be of assistance and that it will not constitute another source of data privacy and security intrusions,” added Brandimarte. “We currently know very little about the level of trust the elderly place in AgingTech and the specific needs of this at-risk population when they interact with it, including data accessibility by family members or caregivers.”
Dr. Daniel Goroff, Vice President and Program Director for Sloan, agrees, “As AgeTech evolves, it brings enormous promise—along with pressing questions about equity, access, and privacy. This initiative will provide insights about how innovations can ethically and responsibly enhance the autonomy and dignity of older adults. We’re excited to see FPF and the University of Arizona leading the way on this timely research.”
Key project outputs will include:
A public taxonomy of AgeTech tools and best practices
Policy reports and recommendations for industry leaders and policymakers
Clear, actionable guidance tailored to address specific challenges identified in the research
Scholarly publications presenting new findings on AgeTech
Resources developed to increase awareness among seniors, caregivers, and policymakers
Events to disseminate findings and share educational materials directly to stakeholder groups, including policymakers, industry leaders, and advocacy groups.
Sign-up for our mailing list to stay informed about future progress, and reach out to Jordan Wrigley ([email protected]) if you are interested in learning more about the project.
Aging at Home: Caregiving, Privacy, and Technology is supported by the Alfred P. Sloan Foundation under Grant No. G-2025-25191.
About The Alfred P. Sloan Foundation
The ALFRED P. SLOAN FOUNDATION is a not-for-profit, mission-driven grantmaking institution dedicated to improving the welfare of all through the advancement of scientific knowledge. Established in 1934 by Alfred Pritchard Sloan Jr., then-President and Chief Executive Officer of the General Motors Corporation, the Foundation makes grants in four broad areas: direct support of research in science, technology, engineering, mathematics, and economics; initiatives to increase the quality, equity, diversity, and inclusiveness of scientific institutions and the science workforce; projects to develop or leverage technology to empower research; and efforts to enhance and deepen public engagement with science and scientists. sloan.org | @SloanFoundation
About Future of Privacy Forum (FPF)
FPF is a global non-profit organization that brings together academics, civil society, government officials, and industry to evaluate the societal, policy, and legal implications of data use, identify the risks, and develop appropriate protections. FPF believes technology and data can benefit society and improve lives if the right laws, policies, and rules are in place. FPF has offices in Washington D.C., Brussels, Singapore, and Tel Aviv. Follow FPF on X and LinkedIn.
About the University of Arizona Eller College of Management
The Eller College of Management at The University of Arizona offers highly ranked undergraduate (BSBA and BSPA), MBA, MPA, masters, and doctoral, Ph.D. degrees in accounting, economics, entrepreneurship, finance, marketing, management and organizations, management information systems (MIS), and public administration and policy in Tucson, Arizona and Phoenix, Arizona.
FPF and OneTrust publish the Updated Guide on Conformity Assessments under the EU AI Act
The Future of Privacy Forum (FPF) and OneTrust have published an updated version of their Conformity Assessments under the EU AI Act: A Step-by-Step Guide, along with an accompanying Infographic. This updated Guide reflects the text of the EU Artificial Intelligence Act (EU AIA), adopted in 2024.
Conformity Assessments (CAs) play a significant role in the EU AIA’s accountability and compliance framework for high-risk AI systems. The updated Guide and Infographic provide a step-by-step roadmap for organizations seeking to understand whether they must conduct a CA. Both resources are designed to support organizations as they navigate their obligations under the AIA and build internal processes that reflect the Act’s overarching accountability. However, they do not constitute legal advice for any specific compliance situation.
Key highlights from the Updated Guide and Infographic:
An overview of the EU AIA and its implementation and compliance timeline. The AIA is a regulation that has tailored obligations depending on the level of risk posed by AI systems, with phased applicability. Some provisions of the AIA began to apply in early 2025, such as the prohibitions on certain AI practices and AI literacy requirements. By 2 August 2025, the infrastructure related to governance and the conformity assessment process must be operational. The full set of obligations for high-risk AI systems, including the requirement to conduct CAs, will apply from 2 August 2026.
Understanding when a conformity assessment is required. The Guide provides a detailed flowchart to help determine whether an AI system is subject to the CA obligations. It outlines key steps, such as determining whether the system falls under the AIA, whether it is classified as “high-risk”, and who is responsible for conducting the CA. CAs are not new in the EU context; the AIA builds on product safety legislation under the New Legislative Framework (NLF) to ensure that high-risk AI systems meet both legal and technical standards before and after being placed on the market and throughout their use.
The CA should be understood as a framework of assessments (both technical and non-technical), requirements, and documentation obligations. The provider should assess whether the AI system poses a high risk and identify both known and potential risks as part of their risk management system. The provider should also ensure that certain requirements are built into the high-risk AI system, such as automatic event recording, human oversight capacity, and transparent operation of the AI system. Additionally, it should verify whether documentation obligations, including technical documentation, are met.
The Guide highlights ongoing standardization efforts and the role of harmonized standards in streamlining the CA process. Systems developed in the context of regulatory sandboxes or certified under cybersecurity schemes may benefit from a presumption of conformity with certain AIA requirements.
TheCAis not a one-off exercise. Compliance must be maintained throughout the AI system’s lifecycle. Providers must ensure ongoing compliance by establishing a monitoring system that enables them to verify that the essential requirements are being met throughout the high-risk AI system’s lifecycle.
You can also view the previous version of the Conformity Assessment Guide here.