TPDP 2024 Talk Notes
This page provides notes for my
TPDP 2024 talk called "Data and Privacy in Data Privacy".
Training Data Attribution
Short Summary
Differential privacy, membership inference, and training data attribution are all research areas that care a lot about "counterfactual worlds" where certain examples are included or excluded. Right now, there is not much overlap between these areas (both in the people working on them and the techniques), and my claim here is that there should be.
Reading List
Open Questions/Directions
Data Curation
Short Summary
Curating data has become very important to training state of the art models. However, there is limited investigation of the implications and opportunities of data curation for privacy.
Reading List
Open Questions/Directions
- Does curating private data help differentially private training? Or does having more data always help?
- What is the best way to do pretraining data selection for a private downstream task?
- Are there other data curation algorithms with negative privacy/security implications?
Privacy Semantics
Short Summary
The ML privacy literature has begun to consider different "privacy semantics". Often dealing more with access control-type approaches rather than differential privacy, the DP community's experience thinking about privacy may be helpful here.
Reading List
-
Contextual integrity in LLMs: Can LLMs keep a Secret?, Contextual Integrity in Privacy-Conscious Assistants
-
Contextual integrity + DP
-
Retrieval-augmented LLMs (with applications to machine unlearning): SILO Language Models. The relevant research direction here is called retrieval-augmented generation (RAG). The backbone retrieval-augmentation technique for SILO is kNN-LM. Some other RAG flavors include RETRO and in context RAG (e.g. a, b, c). Most discussion of RAG these days refers to in context RAG techniques of some form. Language models for search such as Google Search Generative Experience, the Bing Chatbot, or Perplexity can be seen as a form of in context RAG, where the "retriever" is the search engine itself!
Open Questions/Directions
- Are there applications/threat models where combining some of these different privacy semantics (including DP) makes sense?
- Are there interesting attacks on these systems?