The PrivaSeer Project in 2023: Access to 1.4 million privacy policies in one searchable body of documents
In the summer of 2021, FPF announced our participation in a collaborative project with researchers from the Pennsylvania State University and the University of Michigan to develop and build a searchable database of privacy policies and other privacy-related documents, with the support of the National Science Foundation. This project, PrivaSeer, has since become an evolving, publicly available search engine of more than 1.4 million privacy policies.
PrivaSeer is designed to make privacy policies transparent, discoverable, and searchable, for use by researchers in the privacy field as well as privacy practitioners in the marketplace. PrivaSeer supports searches of a corpus of privacy policies collected from the web at distinct points in time – currently four time stamps. Search results can be filtered by a wide variety of parameters, including the date of the crawl, the publisher’s industry, use of particular tracking technologies, inclusion of relevant regulations, assessment on Flesch-Kincaid Reading Level, and more. The high level of customizable searchability is made possible via NLP techniques designed and implemented by researchers at the Pennsylvania State University and the University of Michigan. The project will continue to add new tranches of policies to the existing corpus on a periodic basis.
Two Project-Related Publications Received “Best Student Paper” Awards This Year
In addition to building the eponymous online tool, the PrivaSeer project grant has supported the publication of a number of papers by researchers involved in the privacy field. First, an effort to systematically identify and discuss issues within the privacy research community titled “Researchers’ Experiences in Analyzing Privacy Policies: Challenges and Opportunities” was presented at the 2023 Privacy Enhancing Technologies Symposium held in Lausanne, Switzerland by lead author Abraham Mhaidli, one of PrivaSeer’s graduate researchers from the University of Michigan. The paper was selected as one of the winners of the Symposium’s Andreas Pfitzmann Best Student Paper Award.
The paper was based on semi-structured interviews conducted with 26 researchers from a variety of academic disciplines working in the privacy space, and investigated what common research practices and pitfalls might exist in the privacy research space. The co-authors identified a lack of consistent, re-usable, well-maintained tools as one of the major obstacles to ongoing privacy research, resulting in significant duplication of effort among the research community, and noted the difficulty in fostering interdisciplinary collaboration.
A second paper, “Privacy Now or Never: Large-Scale Extraction and Analysis of Dates in Privacy Policy Text,” was accepted at the 23rd Symposium on Document Engineering (DocEng), hosted in Limerick, Ireland. This paper was presented by PrivaSeer graduate researcher and lead author Mukund Srinath from the Pennsylvania State University, and investigated the degree to which online privacy disclosures comply with annual update requirements across a set of large-scale web crawls containing several million distinct policies. Using a newly developed method for extracting dates from plain-text documents, the researchers discovered that under 40% of public privacy notices contain readable dates, and further, updates correlated heavily to major changes in the data protection legal landscape, with a significant percentage likely dating to 2018 without subsequent change. The paper’s conclusions point to the significant compliance problem of ensuring that privacy notices are actually kept up-to-date, and suggest that for many data controllers this is not the case, although more recent updates were associated with URLs that saw greater amount of online traffic.
A third paper, “Privacy Lost and Found: An Investigation at Scale of Web Privacy Policy Availability,” was also accepted at DocEng, and was further selected as the winner of the Best Student Paper Award. This paper presented a large-scale investigation of the availability of privacy policies, seeking to identify and analyze potential reasons for policy unavailability such as dead links, documents with empty content, documents that consist solely of placeholder text, and documents unavailable in the specific languages offered by their respective websites. The paper was also able to offer critical analysis and conclusions regarding privacy notices generally, based on a number of statistical methodologies. Overall, the researchers found that privacy policy URLs were only available in 34% of websites examined, and were able to estimate population parameters for both the total number of English-language privacy documents on the web and for their likely distribution across different commercial sectors. The study was able to further the privacy research community’s understanding of the overall status of English-language privacy policy policies worldwide, and provide valuable information about the rate and likelihood of users encountering various difficulties in accessing them.
2023 Stakeholders Workshop Provided Valuable Input Into Refining the PrivaSeer Search Engine and Tools
In addition to the publications associated with the PrivaSeer project, on July 25, 2023, the Future of Privacy Forum hosted an interdisciplinary workshop with key stakeholders to present the project to members of the privacy research community in industry and civil society.
July’s workshop featured presentations from FPF’s Vice President for Global Privacy Dr. Gabriela Zanfir Fortuna, as well as project co-leads Dr. Shomir Wilson, Assistant Professor in the College of Information Sciences and Technology at the Pennsylvania State University and Dr. Florian Schaub, Associate Professor of Information and of Electrical Engineering and Computer Science at the University of Michigan. Dr. Zanfir-Fortuna provided a practical demonstration of the PrivaSeer tool in action, while Professors Wilson and Schaub provided an overview of PrivaSeer’s development and current functionality.
Presentations by the project’s co-leads were followed by a discussion of how the tool may be used and improved as a future resource for researchers and industry professionals with various key FPF stakeholders. Discussants raised the prospect of using PrivaSeer to research the emergence of specific terms relating to the use of AI/ML technologies in privacy notices, conduct comparative studies of privacy policies presented in multiple languages, and examine how required disclosures related to cross-border data transfers may be changing over time. Participants also discussed how the tool might be useful in assessing privacy-adjacent disclosures such as cookie notices and terms of service, and provided the research team with a wide array of useful feedback as the project progresses into its third year.
PrivaSeer is now a functional, public-facing tool available to the privacy community, both for researchers and for privacy professionals working in public or private-sector compliance. FPF will continue to support the development of new functionality in the tool, and our team looks forward to contributing however we can to the scholarship in this area.