Researchers from Department of Computer Sciences at the University of Texas at Austin say they can reverse Netflix’s anonymous data (which was released in to the public as part of a contest to see if someone could design a better rating system) by comparing it to only a few ratings on IMDb. The result? Specific users can be identified and linked to their (ostensibly) private ratings.
Releasing the data and just removing the names does nothing for privacy,” Shmatikov told SecurityFocus. “If you know their name and a few records, then you can identify that person in the other (private) database.”
While Netflix’s dataset did not include names, instead using an anonymous identifier for each user, the collection of movie ratings — combined with a public database of ratings — is enough to identify the people, the researchers argued in a paper published soon after Netflix released the data, but which only recently came to light. Narayanan and Shmatikov demonstrated the danger by using public reviews published by a “few dozen” people in the Internet Movie Database (IMDb) to identify movie ratings of two of the users in Netflix’s data.
Exposing movie ratings that the reviewer thought were private could expose significant details about the person. For example, the researchers found that one of the people had strong — ostensibly private — opinions about some liberal and gay-themed films and also had ratings for some religious films.
More generally, the research demonstrated that information that a person believes to be benign could be used to identify them in other private databases.
Scary, scary, scary, scary, scary.
From the research paper:
Does privacy of Netflix ratings matter? The privacy question is not “Does the average Netflix subscriber
care about the privacy of his movie viewing history?,” but “Are there any Netflix subscribers whose privacy
can be compromised by analyzing the Netflix Prize dataset?” The answer to the latter question is, undoubtedly,
yes. As shown by our experiments with cross-correlating non-anonymous records from the Internet Movie Database with anonymized Netflix records (see below), it is possible to learn sensitive non-public information about a person’s political or even sexual preferences. We assert that even if the vast majority of Netflix subscribers did not care about the privacy of their movie ratings (which is not obvious by any means), our analysis would still indicate serious privacy issues with the Netflix Prize dataset.