Lecture_04_ReID_Laws.pdf

Full Transcript

Re-identification and laws ISEC411: Privacy & Anonymity Course Instructor: Dr. Hanane LAMAAZI Outline Why share data? Some definitions How to anonymize data? Examples of de-identifications that went wrong With input from: https://www.cs.purdue.edu/homes/ninghui/cour...

Re-identification and laws ISEC411: Privacy & Anonymity Course Instructor: Dr. Hanane LAMAAZI Outline Why share data? Some definitions How to anonymize data? Examples of de-identifications that went wrong With input from: https://www.cs.purdue.edu/homes/ninghui/courses/ 2 Overall picture database D containing personal information Name1, DOB,.…,v1,v2,… Name2, DOB,…., v1, v2,…... Concern 1: Orchestrate database access: (Who can access this database? How Concern 2: Sharing data for much information should they vital secondary purpose access? Who cannot access (research) the database?) 4 Why we need to share data in general? The need for sharing data For research purposes E.g., social, medical, technological, etc. Data sharing encourages more connection and collaboration between researchers, which can result in important new findings within the field) encouraging better transparency, enabling reproducibility of results, Mandated by laws and regulations E.g., census (counting the population) For business decision making Transparency Trust … However, publishing data may result in privacy violations 6 Example: sharing Health data Health data sharing is has high value to society. Today, health data that you share is driving the most important advance in medicine Research on health data can provide important information about disease trends and risk factors, outcomes of treatment or public health interventions, and health care costs and use. However, need to preserve the privacy of the individuals contributing the data. Sharing data According to your understanding of data privacy, when can we share data? If subjects consent If data is anonymous 9 Context Microdata is information at the level of individual respondents. Example: national census collects age, address, employment,…. The entries are recorded separately for every person who responds. Macrodata is used mainly to describe aggregated data (statistical summaries) Examples of aggregate data include summaries of the properties of individuals, unemployment statistics, demographics, GDP etc. We will focus on microdata 11 Types of Attackers Journalist risk (most common, relies on available public databases) Attacker does not have any background knowledge about the subjects that generated the data Prosecutor risk (uses background information to re-identify subjects) Attacker has some background information about the data subjects (example: Japanese have very low rate of heart attacks, my neighbor went to hospital on given day and time) For this lecture, we focus on journalist risk 12 De-identification and re-identification What is de-identification? Remove identity from data (in this course, we assume similar meaning to anonymization) What about re-identification? Reassign the identity to a de-identified record How to share data and preserve privacy? anonymize the data (what is anonymity?) How? People thought: remove the identity from the data (or de-identify the data), but how? First thought: Remove “Direct identifiers” attributes that uniquely (and directly) identifies a person examples: Name, Social Security number, phone number, email, address… Is this enough? Context Direct identifiers are removed Assume journalist risk (Attacker has no background knowledge) Can a journalist attacker violate the privacy of the subjects in the dataset? In other words, can she re-identify what was de- identified? Next we will look at examples of de-identifications that removed PIIs, but first we will look at how re-identification is done Re-identification Example AOL Data Release [NYTimes 2006] In August 2006, AOL (American Online) Released search keywords of 650,000 users over a 3-month period. User IDs are replaced by random numbers. Lee at al AOL Data Release [NYTimes 2006] User No. 4417749 conducted hundreds of searches over a three month period on topics ranging from “numb fingers” “60 single men” “dog that urinates on everything.” “landscapers in Lilburn, Ga,” “several people with the last name Arnold” and “homes sold in shadow lake subdivision Gwinnett county Georgia.” It did not take much investigating to follow that data trail to Thelma Arnold, a 62 ‐ year ‐ old widow who lives in Lilburn, Ga. Lee at al 20 AOL Data Release [NYTimes 2006] In August 2006, AOL Released search keywords of 650,000 users over a 3-month period. User IDs are replaced by random numbers. 3 days later, pulled the data from public access. AOL searcher # 4417749 Thelman Arnold, a 62 “landscapers in Lilburn, GA” NYT year old widow queries on last name “Arnold” who lives in “homes sold in shadow lake Liburn GA, has subdivision Gwinnett County, GA” three dogs, “num fingers” frequently “60 single men” searches her “dog that urinates on everything” friends’ medical ailments. 21 AOL Data Release [NYTimes 2006] “Ms. Arnold, who agreed to discuss her searches with a reporter, said she was shocked to hear that AOL had saved and published three months’ worth of them” (NYT 2006). “My goodness, it’s my whole personal life,” she said. “I had no idea somebody was looking over my shoulder.” AOL Data Release [NYTimes 2006]-Outcome CTO of AOL resigned The researcher who released this data and his boss were fired big embarrassment for AOL Question How did re-identification happen in this case? Some direct identifiers were missed (last name): insufficient or wrong de-identification if all identifiers were removed, would we be able to re-identify? How can re-identification be done? 25

Use Quizgecko on...
Browser
Browser