Differential Privacy

Differential privacy is a precise, mathematical definition of what it means to protect an individual’s privacy when publishing statistics or training a model on their data. Its promise is this: the result of an analysis should look essentially the same whether or not any single person’s record is included. If no one can tell from the output whether you were in the dataset, then the output cannot have revealed much about you specifically. The definition was introduced in the 2006 paper “Calibrating Noise to Sensitivity in Private Data Analysis” by Cynthia Dwork and colleagues, and consolidated in Dwork and Roth’s 2014 monograph “The Algorithmic Foundations of Differential Privacy.”

The guarantee is achieved by deliberately adding a calibrated amount of random noise to results. How much noise depends on the query’s sensitivity, how much one person’s record could change the answer, and on a parameter usually written as epsilon, called the privacy budget. A small epsilon means more noise and stronger privacy; a larger epsilon means less noise and more accuracy. Every query against the data spends some of this budget, and the budget bounds how much can ever be learned about any individual across all the analyses performed. Two standard tools implement this: the Laplace mechanism and the Gaussian mechanism, which add noise of specific shapes.

What sets differential privacy apart from older approaches like removing names or k-anonymity is that it makes no assumptions about what an attacker already knows. Earlier methods were repeatedly defeated by re-identification, where outside information was used to re-link “anonymous” records to real people. Differential privacy’s guarantee holds regardless of the attacker’s side information, which is why it became the gold standard.

For a business reader, differential privacy is the only privacy technique that comes with a provable, quantifiable guarantee you can show a regulator or customer rather than merely assert. Its honest cost is accuracy: the protective noise makes results less precise, and stronger guarantees cost more accuracy, so deploying it is always a deliberate trade between privacy and utility. Apple, Google, and the U.S. Census Bureau all use it in production.

Sources

Related