Brazil’s ANPD Preliminary Study on Generative AI highlights the dual nature of data protection law: balancing rights with technological innovation
Brazil’s Autoridade Nacional de Proteção de Dados (“ANPD”) Technology and Research Unit (“CGTP”) released the preliminary study Inteligência Artificial Generativa (“Preliminary Study on Gen-AI”, in Portuguese) as part of its Technological Radar series, on November 29, 2024. A short English version of the study was also released by the agency in December 2024. 1 This analysis generally aims to provide information for developers, processing agents, and data subjects on the potential benefits and challenges of generative AI in relation to the processing of personal information under existing data protection rules. The increasing worldwide use and popularity of generative AI systems have led global regulators to specifically examine these technologies’ impact on privacy and data protection.2
This study does not offer official guidance for compliance purposes; instead, it sets out the ANPD’s preliminary and general approach to generative AI by exploring essential concepts related to these systems, identifying new and existing privacy and data protection risks, and establishing their relationship with the data protection principles of the Lei Geral de Proteção de Dados (“LGPD”), Brazil’s national data protection law. This blog spotlights the main findings of the study regarding the development and use of generative artificial intelligence systems within the Brazilian context.
Balancing Rights with Technological Innovation: An LGPD Commitment
The study acknowledges the relevance of balancing rights with technological innovation under the Brazilian framework. Article 1 of the LGPD identifies the objective of the law as ensuring the processing of personal data protects the fundamental rights of freedom, privacy, and the free development of personality3. At the same time, Article 2 of the LGPD recognizes data protection is “grounded” on economic and technological development and innovation.
The study recognizes that advances in machine learning enable generative AI systems beneficial to key fields, including healthcare, banking, and commerce and highlights three use cases likely to produce valuable benefits for Brazilian society. For instance, the Federal Court of Accounts is implementing “ChatTCU”, a generative model to assist the Court’s legal team in producing, translating, and examining legal texts more efficiently. Munai, a local health tech enterprise, is also developing a virtual assistant that will automate the evaluation, interpretation, and application of hospital protocols and support decision-making in the healthcare sector. Finally, Banco do Brasil is developing a Large Language Model (LLM) to assist employees in providing better customer service experiences. The study also highlights the increasing popularity of commercially available generative AI systems such as OpenAI’s ChatGPT and Google’s Gemini among Brazilian users.
In this context, the study emphasizes that while generative AI systems can produce multiple benefits, it is necessary to assess their potential for creating new privacy risks and exacerbating existing ones. For the ANPD, “the generative approach is distinct from other artificial intelligence as it possesses the ability to generate content (data) […] which allows the system to learn how to make decisions according to the data uses.4” In this context, the CGTP identifies three fundamental characteristics of generative AI systems that are relevant in the context of personal data processing:
- 1. The need for large volumes of personal and non-personal data for system training purposes;
- 2. The capability of inference that allows the generation of new data similar to the training data; and
- 3. The adoption of a diverse set of computational techniques, such as the architecture of transformers for natural language processing systems.5
For instance, the study mentions LLMs as examples of models trained on large volumes of data. LLMs capture semantic and syntactic relationships and are effective at understanding and generating text across different domains. However, they can also generate misleading answers and invent inaccurate “hallucinations.” Another example are foundational models, which are trained on diverse datasets and can perform tasks in multiple domains, often including some for which the model was not explicitly trained.
The document underscores that the technical characteristics and possibilities of generative AI significantly impact the collection, storage, processing, sharing, and deletion of personal data. Therefore, the study holds, LGPD principles and obligations are relevant for data subjects and processing agents using generative AI systems.
Legality of web scraping, impacted by the fact the LGPD covers publicly accessible personal data
The study notes that generative AI systems are typically trained with data collected through web scraping. Data scraped from publicly available sources may include identifiable information such as names, addresses, videos, opinions, user preferences, images, or other personal identifiers. Additionally, the absence of thoughtful pre-processing practices in the collection phase (i.e. anonymizing or collecting only necessary data) increases the likelihood of including more personal data for training purposes, including sensitive and children’s data.
The document emphasizes that the LGPD covers publicly accessible personal data, and consequently, processors and AI developers must ensure compliance with personal data principles and obligations. Scraping operations that capture personal data must be based on one of the LGPD’s lawful bases for processing (Articles 7 and 11) and comply with data protection principles of good faith, purpose limitation, adequacy, and necessity (Article 7, par. 3).
Moreover, the study warns that web scraping reduces data subjects’ control over their personal information. According to the CGTP, users generally remain unaware of web scraping involving their information and how developers may use their data to train generative AI systems. In some cases, scraping can result in a data subject’s loss of control over personal information after the user deletes or requests deletion of their data from a website, as prior scraping and data aggregation may have captured the data and made it available in open repositories.
Allocation of responsibility depends on patterns of data sharing and hallucinations
The ANPD also takes note of the processing of personal data during several stages in the life cycle of generative AI systems, from development to refinement of models. The study explains that generative AI’s ability to generate synthetic content extends beyond basic processing and encompasses continuous learning and modeling based on the ingested training data. Although the training data may be hidden through mathematical processes during training, the CTGP warns that vulnerabilities to the system, such as model inversion or membership inference attacks, could expose individuals included in training datasets.
Furthermore, generative AI systems allow users to interact with models using natural language. Depending on the prompt, context, and information provided by the user, these interactions may generate outputs containing personal data about the user or other individuals. A notable challenge, according to the study, is to allocate responsibility in scenarios where i) personal data is generated and shared with third parties, even if a model was not specifically trained for that purpose; and ii) where a model creates a hallucination – false, harmful, or erroneous assumptions about a person’s life, dignity, or reputation, harming the subject’s right to free development of personality.
The study identifies three example scenarios in which personal data sharing can occur in the context of generative AI systems:
- 1. Users sharing personal data through prompts
This type of sharing occurs through the input of prompts by users, which can allow users to share information in diverse formats such as text, audio, and images, all of which may contain personal, confidential, and sensitive data. In some instances, users may not be aware of the risks involved in sharing personal information or, if aware, they might choose to “trust the system” to get the answers and assistance they need. In this scenario, the CGTP points out that safeguards should be developed to create privacy-friendly systems. One way to achieve this is to provide users with clear and easily accessible information about the use of prompts and the processing of personal data by generative AI tools.
The study highlights that users sharing the personal data of other individuals through prompts may be considered processing agents under the LGPD and consequently be subject to its obligations and sanctioning regime. Nonetheless, the CGTP cautions that transferring responsibility exclusively to users is not enough to safeguard personal data protection or privacy in the context of generative AI.
- 2. Sharing AI-generated outputs containing personal data with third parties
Under this scenario, output or AI-generated content contains personal data, which could be shared with third parties. The CGTP notes this presents the risk of the personal data being used for secondary purposes unknown to the initial user that the AI developer is unlikely to control. Similar to the previous scenario and data processing activities in general, the study notes the relevance of establishing a “chain of responsibility” among the different agents involved to ensure compliance with the LGPD.
- 3. Sharing pre-trained models containing personal data
A third scenario is sharing a pre-trained model itself, and consequently, the personal data present in the model. According to the CGTP, “since pre-trained models can be considered a reflection of the database used for training, the popularization of the creation of APIs (Application Programming Interfaces) that adopt foundational models such as pre-trained LLMs, brings a new challenge. Sharing models tends to involve the data that is mathematically present in them”6 (translated from the Portuguese study). Pre-trained models, which contain a reflection of the training data, make it possible to adjust the foundational model for a specific use or domain.
The CGTP cautions that the possibility of refining a model via the results obtained through prompt interaction may allow for a “continuous cycle of processing” of personal data.7 According to the technical Unit, “the sharing of foundational models that have been trained with personal data, as well as the use of this data for refinement, may involve risks related to data protection depending on the purpose8.”
Relatedly, the document highlights the relevance of the right to delete personal data in the context of generative AI systems. The study emphasizes that the processing of personal data can be present through diverse stages of the AI’s lifecycle, including the generation of synthetic content, through prompt interaction – which allows new data to be shared – and the continuous refinement of the model. In this context, the study points out that this continuous processing of personal data presents significant challenges in (i) delimiting the end of the processing period; (ii) determining whether the purpose of the intended processing was achieved, and (iii) the implications of revoking consent, if the processing relied on this basis.
Transparency and Necessity Principles: Essential for Responsible Gen-AI under the LGPD
Some LGPD principles have special relevance for the development and use of generative AI systems. The report takes the view that these systems typically lack detailed technical and non-technical information about the processing of personal data. The CGTP warns that this absence of transparency begins in the pre-training phase and extends to the training and refinement of models. The study suggests developers may fail to inform users about how their personal information could be shared under the three scenarios identified above (prompt use, outputs, or foundational models). As a result, individuals are usually unaware their information is used for generative AI training purposes and are not provided with adequate, clear, and accessible information about other processing operations such as sharing their personal information with third parties.
In this context, the ANPD emphasizes that the transparency principle is especially relevant in the context of the responsible use and development of AI systems. Under the LGPD, this principle requires clear, precise, and easily accessible information about the data processing. The CGTP proposes that the existence and availability of detailed documentation can be a starting point for compliance and can help monitor the development and improvement of generative AI systems.
Similarly, the necessity principle limits data processing to what is strictly required for developing generative AI systems. Under the LGPD, this principle requires the processing to be the minimum required for the accomplishment of its purposes, encompassing relevant, proportional, and non-excessive data. According to the ANPD, AI developers should be thoughtful about the data to be included in their training datasets and make reasonable efforts to limit the amount and type of information necessary for the purposes to be achieved by the system. Determining how to apply this principle to the creation of multipurpose or general-purpose “foundation models” is an ongoing challenge in the broader data protection space.
Looking Into the Future
The study concludes that generative AI must be developed from an “ethical, legal, and socio-technical” perspective if society is going to effectively harness its benefits while limiting the risks it poses. The CGTP acknowledges that generative AI may offer solutions in multiple fields and applications, however, society and regulators must be aware that generative AI may also entail new risks or exacerbate existing ones concerning privacy, data protection, and other freedoms. The CGTP highlights that this first report includes preliminary analysis and that further studies in the field are necessary to guarantee adequate protection of personal data, as well as the trustworthiness of the outputs generated by this technology.
- The ANPD’s “Technological Radar” series address “emerging technologies that will impact or are already impacting the national and international scenario of personal data protection” with an emphasis on the Brazilian context. “The purpose of the series is to aggregate relevant information to the debate on data protection in the country, with educational texts accessible to the general public”. ↩︎
- See, for example, Infocomm Media Development Authority, “Model AI Governance Framework for Generative AI” (May 2024); European Data Protection Supervisor, “First EDPS Orientations for ensuring data protection compliance when using Generative AI systems” (June 2024); Commission nationale de l’informatique et des libertés (CNIL), “AI how-to sheets” (June 2024) ; UK’s Information Commissioner’s Office, “Information Commissioner’s Office response to the consultation series on generative AI” (December 2024); European Data Protection Board, “Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models” (December 2024).
↩︎ - LGPD Article 1, available at http://www.planalto.gov.br/ccivil_03/_ato2015-2018/2018/lei/L13709compilado.htm. ↩︎
- ANPD, Technology Radar, “Generative Artificial Intelligence”, 2024, p. 7. ↩︎
- ANPD, Radar Tecnologico, “Inteligência Artificial Generativa”, 2024, pp. 16-17. ↩︎
- ANPD, Radar Tecnologico, “Inteligência Artificial Generativa”, 2024, pp. 24-25. ↩︎
- Id. ↩︎
- Id. ↩︎