The 'anonymous' Netflix dataset was not anonymous

When Netflix released its prize dataset in 2006, it stripped out names and other obvious identifiers, presenting roughly 100 million ratings as anonymous. Researchers Arvind Narayanan and Vitaly Shmatikov set out to test that assumption. In their paper “Robust De-anonymization of Large Sparse Datasets,” they present “a new class of statistical de-anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records,” and apply the method “to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers.”

The attack worked by exploiting how unique a person’s pattern of preferences really is. Knowing just a few of someone’s movie ratings and roughly when they made them, the researchers showed they could pick that individual out of the anonymized data with high confidence. They demonstrated this by cross-referencing Netflix records with publicly visible ratings on the Internet Movie Database, linking supposedly anonymous Netflix users to named IMDb accounts, and from there to potentially sensitive inferences about their tastes.

The fallout was real. A lawsuit was filed against Netflix and the Federal Trade Commission raised privacy questions about a planned second contest, which would have included demographic data. In March 2010 Netflix announced it would not run the sequel it had previously promised, settling the matter.

The lesson, repeated many times since, is that anonymized data often is not actually anonymous. Removing names is not enough when the remaining data is rich and unique enough to act as a fingerprint. For any business sitting on detailed behavioral data, the Netflix case is the canonical warning that “anonymized” is a claim that must be tested, not assumed.

The 'anonymous' Netflix dataset was not anonymous

Sources

Related