About


The Usable Privacy Policy Project

Natural language privacy policies have become a de facto standard to address expectations of “notice and choice” on the Web. Yet, there is ample evidence that users generally do not read these policies and that those who occasionally do struggle to understand what they read. The Usable Privacy Policy Project builds on recent advances in machine learning, natural language processing, privacy preference modeling, crowdsourcing, formal methods, and privacy interfaces to overcome this situation.

You can learn more about the Usable Privacy Policy Project, including our approach, affiliated organizations, publications, and recent news, at www.usableprivacy.org.

Machine-Annotated Privacy Policies

In addition to the human-annotated OPP-115 Privacy Policy Corpus you can also explore more than 7,000 policies that contain annotations automatically created by our supervised machine learning technique. We used the human-annotated policies to train and test our classifiers in order to evaluate the extent to which the automatic policy analysis can be scaled with minimal human intervention. Beyond the classification many other parts of the policy analysis pipeline are automated, most notably, finding privacy policy links on webpages, downloading the policies, and dividing each policy into smaller segments that can be classified with various data practices.

Please note that the automatic link detection, download of policies, and use of machine learning techniques to analyze the policies sometimes produce erroneous annotations. This remains an experimental system and we are keen to collect feedback and further improve our techniques. You may notice that our system will omit some labels rather than take a chance and have an inaccurate label. In other words, when you look at our labels, you may find that some statements made in the policy are not reflected in the labels generated by the machine learning techniques. We are aware of this, but are more interested in feedback about labels that correspond to statements that are simply not present anywhere in the text highlighted by our techniques or other similar errors (e.g. mislabeling first party collection as third party collection) made by our machine learning techniques. If you see such errors (we know there are some), please take the time to notify us by email or by using the "Feedback" button. Finally, we also realize that occasionally, the text downloaded by our system may not be a privacy policy (e.g. our system might mistakenly download Terms of Service). If you find text that is not a privacy policy, we would also appreciate if you could report it with the "Feedback" button. Thank you for your assistance!

People interested in additional details on the performance of our machine learning techniques are invited to read the following two articles:

  • Towards Automatic Classification of Privacy Policy Text. Frederick Liu, Shomir Wilson, Peter Story, Sebastian Zimmeck and Norman Sadeh. Carnegie Mellon University Technical Report CMU‐ISR‐17‐118 and CMU‐LTI‐17‐010, Institute for Software Research and Language Technologies Institute, School of Computer Science, Dec 2017
  • Identifying the Provision of Choices in Privacy Policy Text. Kanthashree Mysore Sathyendra, Shomir Wilson, Florian Schaub, Sebastian Zimmeck, and Norman Sadeh. Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, Sep 2017
The techniques used to generate the annotations you see are slight improvements of the techniques described in these two papers. We will continue to work on improving our techniques over time and will attempt to also use feedback you provide us.

Contact Us

If you have questions concerning our project or if you are interested in collaborating with us, please contact the project’s lead principal investigator Prof. Norman Sadeh. Please subscribe to our public mailing list to receive news and updates about the project.

Sponsor

This project is a Frontier project funded by the National Science Foundation under its Secure and Trustworthy Computing initiative (CNS-1330596)

machine?