One privacy-preserving mechanism used in data analytics is to anonymize or de-identify the data. The intuitive idea is that you can preserve the privacy of individuals whose data is being used if you remove information that allows those individuals to be identified. In practice, there is not a single widely-used method for anonymizing data. Different regulations specify requirements to consider a data set to be de-identified and knowing which regulation governs your work and what the requirements are for de-identification is important if you are to comply with the appropriate laws.
These subjects are discussed in greater detail in our course, but here is a quick primer on different kinds of anonymization/de-identification by looking at three different regulations.

Marina Luisa, the team leader of helloQt, 2023 Courtesy of helloQt

Directory Information, Health Data, and HIPAA

An early attempt to protect the privacy of research subjects can be found in the Health Insurance Portability and Accountability Act (HIPAA). That regulation states that a data set will be considered anonymized if the directory information about each of the subjects has been removed. Directory information is such data as name, social security number, or address. There are about 40 fields that are considered to be directory information; if you remove those from your data set, it can be shared without violating the privacy regulation.
While removing directory information will keep you within the rules of the regulation, it will not protect the privacy of the individuals in the data set. This was shown in 1997 when Latanya Sweeney re-identified medical records that had been de-identified in accordance with HIPAA’s privacy regulation, including the record of the then-governor of Massachusetts.
Sweeney’s work led to the notion of a quasi-identifier, which is information about you (e.g., gender or birthdate) that cannot alone directly identify you, but can be combined together and the combination found in another data set. A problem occurs when this second data set (e.g., voter registration databases) contains some of your directory information. By linking quasi-identifiers across data sets, an adversary can re-identify a record in a “de-identified” data set and discover whatever personal information (e.g., medical treatments) was meant to be kept anonymous.

K-Anonymity, Educational data, and FERPA

An early attempt to protect the privacy of research subjects can be found in the Health Insurance Portability and Accountability Act (HIPAA). That regulation states that a data set will be considered anonymized if the directory information about each of the subjects has been removed. Directory information is such data as name, social security number, or address. There are about 40 fields that are considered to be directory information; if you remove those from your data set, it can be shared without violating the privacy regulation.
While removing directory information will keep you within the rules of the regulation, it will not protect the privacy of the individuals in the data set. This was shown in 1997 when Latanya Sweeney re-identified medical records that had been de-identified in accordance with HIPAA’s privacy regulation, including the record of the then-governor of Massachusetts.
Sweeney’s work led to the notion of a quasi-identifier, which is information about you (e.g., gender or birthdate) that cannot alone directly identify you, but can be combined together and the combination found in another data set. A problem occurs when this second data set (e.g., voter registration databases) contains some of your directory information. By linking quasi-identifiers across data sets, an adversary can re-identify a record in a “de-identified” data set and discover whatever personal information (e.g., medical treatments) was meant to be kept anonymous.

Differential Privacy, GDPR, and the U.S. Census

An early attempt to protect the privacy of research subjects can be found in the Health Insurance Portability and Accountability Act (HIPAA). That regulation states that a data set will be considered anonymized if the directory information about each of the subjects has been removed. Directory information is such data as name, social security number, or address. There are about 40 fields that are considered to be directory information; if you remove those from your data set, it can be shared without violating the privacy regulation.
While removing directory information will keep you within the rules of the regulation, it will not protect the privacy of the individuals in the data set. This was shown in 1997 when Latanya Sweeney re-identified medical records that had been de-identified in accordance with HIPAA’s privacy regulation, including the record of the then-governor of Massachusetts.
Sweeney’s work led to the notion of a quasi-identifier, which is information about you (e.g., gender or birthdate) that cannot alone directly identify you, but can be combined together and the combination found in another data set. A problem occurs when this second data set (e.g., voter registration databases) contains some of your directory information. By linking quasi-identifiers across data sets, an adversary can re-identify a record in a “de-identified” data set and discover whatever personal information (e.g., medical treatments) was meant to be kept anonymous.

Interested in learning more about trending topics in data privacy from helloQt? Visit the Safe&Sound page for our take on ChatGPT, TikTok, and password Security, or apply to join the next event of our course Data Privacy and Technology.

Content is provided for informational purposes only and does not constitute legal advice.