NYC Privacy Day Fall 2024 Talk Notes

This page provides notes for my NYC Privacy Day Fall 2024 talk called "Is Memorization Membership?".

Memorization and Membership Inference

Membership Inference is a privacy attack on a data example that asks “was this example used in training?” The concrete risk of membership inference is often motivated with medical use cases: if I learn that someone’s data appeared in a drug trial for a specific disease, that membership reveals that they have the disease. See Homer et al. 2008, Shokri et al. 2016, Yeom et al. 2017

Memorization is an overloaded word, which I use to refer here to a generative model’s “vulnerability to training data extraction attacks”, attacks which recover individual training data records from a model. There are several papers that propose definitions capturing “success” of such a method, and this talk is agnostic to the specific definition (see Carlini et al. 2020, Carlini et al. 2023, Nasr et al. 2023, Schwarzchild et al. 2024, but not Zhang et al. 2021).

Intuitively, these are similar threats: they both are involve leakage of training data from a model, and, on a technical level, membership inference is a common subroutine in training data extraction (e.g. Carlini et al. 2020, Carlini et al. 2023). However, the goal of this talk is to “separate” membership inference and training data extraction attacks. There are some properties of datasets and model training which have differing impacts on the two attacks, and understanding these differences is very important for understanding how we should measure and mitigate privacy risks of training.

When memorization and membership inference don’t agree

The three main properties I’ve seen separate extraction and membership inference are 1) duplication, 2) poisoning, and 3) training data order. Both duplication and poisoning are examples of membership inference being more effective on “outlier” data, while extraction is more effective on “inlier” data.

Duplication

Duplication is when different examples in the training set are nearly identical. This is intentionally imprecise - there are many different ways of deciding when two examples are “nearly identical”. Note: in my talk, this is the only property I had time for!

Duplication and training data extraction. It is by now well known that training data extraction is easier for duplicated data. See Lee et al. 2021, Kandpal et al. 2022, Carlini et al. 2023.

Duplication and membership inference. Duplication has the opposite impact on membership inference. If there exist two examples which are near duplicates of each other, they likely have very similar impacts on training, and so it is hard to tell a model trained on one from a model trained on the other. See Carlini et al. 2022, Duan et al. 2024, Zhang et al. 2024.

Data Poisoning

Data poisoning is when an adversary adds some carefully chosen data into a training set, to modify a trained model’s behavior. The observations here mainly come from Tramer et al. 2022 and Chen et al. 2022, and Carlini et al. 2022.

Poisoning and membership inference. Poisoning attacks can amplify membership inference: these attacks will generally force an example to “become an outlier” by adding similar, but mislabeled examples.

Poisoning and training data extraction. By adding mislabeled examples, the models ability to correctly predict/generate an example decreases substantially. Improvements to membership inference come at the cost of a worse ability to extract information.

Training Data Order

Large models now train for a small number of epochs, meaning some examples may be seen only at the start of training, or only at the end. This may . This line of investigation derives from work in differential privacy (Feldman et al. 2018).

Training order and membership inference. Examples seen early in training are often less vulnerable to membership inference than more recently seen examples. This holds for specific differentially private training algorithms (Feldman et al. 2018) as well as empirically for standard model training (Jagielski et al. 2022).

Training order and extraction. Interestingly, training data order doesn’t seem to impact extraction as significantly as membership inference (see Tirumala et al. 2022 and Biderman et al. 2023).

So what?

When I look at these, though, what I see are two important takeaways:

Neither Attack Directly Measures “Privacy”

Looking at the impact of duplication on our attacks, training data extraction starts to look too expansive, often catching things which are not interesting cases of privacy leakage due to duplication on the internet (e.g. code licenses, the Bible, lists of numbers). However, when looking at membership inference, we seem to be casting too narrow a net, ignoring the impact of duplication.

For measuring privacy risk, we’d like something closer to understanding how much a model learns about what is *unique* about a *person*, catching things like Ann Graham Lotz or personal information about an individual. In general, though, this is a hard problem. Attempts have been made, such as by identifying PII (see Lukas et al. 2023 or recent memorization evaluations for the Gemma models), or by inserting canary data (see Carlini et al. 2018 and more recently Zhang et al. 2024). Both of these are nice (of course they have drawbacks), but one big omission is unstructured private information, which is generally harder to annotate (I’m optimistic, though, since datasets are out there: TOFU, MedQA/WildChat).

Preventing Either Attack Doesn’t Necessarily Improve “Privacy”

Similar to measurement, mitigation is challenging. Differential privacy is the go-to for preventing privacy attacks, but of course 1) struggles with utility, and 2) degrades with duplication. Deduplication is also common, but will not generally capture diverse ways of presenting the same sensitive information. I’d be really curious to see anyone tackle the problem of plugging holes with either (perhaps through better annotation of privacy units).