Introduction to Privacy Preservation PDF
Document Details
Uploaded by HumorousSodium
Tags
Related
Summary
This presentation introduces the concept of privacy, covering personal, territorial, and informational aspects. It also highlights basic privacy principles and discusses privacy-enhancing technologies (PETs), and methods of anonymization applied to large datasets. Methods covered include modifying data, and replacing data points with averages with different anonymization techniques.
Full Transcript
Introduction to Privacy Preservation Outline Need for privacy Threats to privacy Privacy Controls Technical privacy controls - Privacy-Enhancing Technologies (PETs) a) Protecting user identities b) Protecting usee identities c) Protecting confidentiality & integrity...
Introduction to Privacy Preservation Outline Need for privacy Threats to privacy Privacy Controls Technical privacy controls - Privacy-Enhancing Technologies (PETs) a) Protecting user identities b) Protecting usee identities c) Protecting confidentiality & integrity of personal data Privacy in pervasive computing Using trust for privacy protection Privacy metrics Anonymity set size metrics Entropy-based metrics Anonymization using XLSTAT in Excel Anonymization using ARX tool 2 Introduction Def. of privacy : the claim of individuals, groups and institutions to determine for themselves, when, how and to what extent information about them is communicated to others 3 dimensions of privacy: 1) Personal privacy Protecting a person against undue interference (such as physical searches) and information that violates his/her moral sense 2) Territorial privacy Protecting a physical area surrounding a person that may not be violated without the acquiescence of the person Safeguards: laws referring to trespassers search warrants 3) Informational privacy Deals with the gathering, compilation and selective dissemination of information 3 Introduction Basic privacy principles Lawfulness and fairness Necessity of data collection and processing Purpose specification and purpose binding There are no "non-sensitive" data Transparency Data subject´s right to information correction, erasure or blocking of incorrect/ illegally stored data Supervision (= control by independent data protection authority) & sanctions Adequate organizational and technical safeguards Privacy protection can be undertaken by: Privacy and data protection laws promoted by government Self-regulation for fair information practices by codes of conducts promoted by businesses Privacy-enhancing technologies (PETs) adopted by individuals Privacy education of consumers and IT professionals 4 Need for Privacy By individuals 99% unwilling to reveal their SSN 18% unwilling to reveal their… favorite TV show By businesses Online consumers worrying about revealing personal data held back $15 billion in online revenue 5 Threats to Privacy 1. Threats to privacy at application level Threats to collection / transmission of large quantities of personal data Incl. projects for new applications on Information Highway Health Networks / Public administration Networks Research Networks / Electronic Commerce / Teleworking Distance Learning / Private use Example: Information infrastructure for a better healthcare 6 Threat to Privacy 2) Threats to privacy at communication level Threats to anonymity of sender / forwarder / receiver Threats to anonymity of service provider Threats to privacy of communication Extraction of user profiles & its long-term storage 3) Threats to privacy at system level 4) Threats to privacy in audit trails 7 Threat to Privacy Identity theft – the most serious crime against privacy Threats to privacy – another view Aggregation and data mining Poor system security Government threats The Internet as privacy threat Unencrypted e-mail / web surfing / attacks Corporate rights and private business Privacy for sale - many traps “Free” is not free… E.g., accepting frequent-buyer cards reduces your privacy 8 Privacy Controls Technical privacy controls - Privacy-Enhancing Technologies (PETs) a) Protecting user identities b) Protecting usee identities c) Protecting confidentiality & integrity of personal data 9 Technical Privacy Controls Technical controls - Privacy-Enhancing Technologies (PETs) a) Protecting user identities via, e.g.: Anonymity - a user may use a resource or service without disclosing his identity Pseudonymity - a user acting under a pseudonym may use a resource or service without disclosing his identity Unobservability - a user may use a resource or service without others being able to observe that the resource or service is being used Unlinkability - sender and recipient cannot be identified as communicating with each other 10 Technical Privacy Controls Taxonomies of pseudonyms Taxonomy of pseudonyms w.r.t. their function i) Personal pseudonyms Public personal pseudonyms / Nonpublic personal pseudonyms / Private personal pseudonyms ii) Role pseudonyms Business pseudonyms / Transaction pseudonyms Taxonomy of pseudonyms w.r.t. their generation i) Self-generated pseudonyms ii) Reference pseudonyms iii) Cryptographic pseudonyms iv) One-way pseudonyms 11 Technical Privacy Controls b) Protecting usee identities Depersonalization (anonymization) of data subjects Perfect depersonalization: Data rendered anonymous in such a way that the data subject is no longer identifiable Practical depersonalization: The modification of personal data so that the information concerning personal or material circumstances can no longer or only with a disproportionate amount of time, expense and labor be attributed to an identified or identifiable individual Controls for depersonalization include: Inference controls for statistical databases Privacy-preserving methods for data mining 12 Technical Privacy Controls The risk of reidentification (a threat to anonymity) Types of data in statistical records: Identity data - e.g., name, address, personal number Demographic data - e.g., sex, age, nationality Analysis data - e.g., diseases, habits The degree of anonymity of statistical data depends on: Database size The entropy of the demographic data attributes that can serve as supplementary knowledge for an attacker The entropy of the demographic data attributes depends on: The number of attributes The number of possible values of each attribute Frequency distribution of the values Dependencies between attributes 13 Technical Privacy Controls c) Protecting confidentiality and integrity of personal data Privacy-enhanced identity management Limiting access control Enterprise privacy policies Steganography Specific tools 14 Privacy in Pervasive Computing In pervasive computing environments, socially-based paradigms, trust will play a big role People surrounded by zillions of computing devices of all kinds, sizes, and aptitudes Most with limited / rudimentary capabilities Quite small, e.g., RFID tags, smart dust Most embedded in artifacts for everyday use, or even human bodies Danger of malevolent opportunistic sensor networks — pervasive devices self-organizing into huge spy networks Able to spy anywhere, anytime, on everybody and everything Need means of detection & neutralization 15 Using Trust for Privacy Protection Privacy builds Trust; Trust yields disclosure; Entity’s ability to control the availability and exposure of information about itself Privacy and trust are closely related Trust is a socially-based paradigm Privacy-trust tradeoff: Entity can trade privacy for a corresponding gain in its partners’ trust in it The scope of an entity’s privacy disclosure should be proportional to the benefits expected from the interaction E.g.: a customer applying for a mortgage must reveal much more personal data than someone buying a book 16 Privacy Metrics Privacy Metrics A. Anonymity set size metrics B. Entropy-based metrics 17 Requirements for Privacy Metrics Privacy metrics should account for: Dynamics of legitimate users How users interact with the system? E.g., repeated patterns of accessing the same data can leak information to a violator Dynamics of violators How much information a violator gains by watching the system for a period of time? Associated costs Storage, injected traffic, consumed CPU cycles, delay 18 Anonymity Set Size Metrics The larger set of indistinguishable entities, the lower probability of identifying any one of them Can use to ”anonymize” a selected private attribute value within the domain of its all possible values “Hiding in a crowd” “Less” anonymous (1/4) “More” anonymous (1/n) 19 Anonymity Set Anonymity set A A = {(s1, p1), (s2, p2), …, (sn, pn)} si: subject i who might access private data or: i-th possible value for a private data attribute pi: probability that si accessed private data or: probability that the attribute assumes the i-th possible value 20 Effective Anonymity Set Size Effective anonymity set size is | A| L | A | min( p i ,1 / | A |) i 1 Maximum value of L is |A| iff all pi’’s are equal to 1/|A| L below maximum when distribution is skewed skewed when pi’’s have different values Deficiency: L does not consider violator’s learning behavior 21 Entropy-based Metrics Entropy measures the randomness, or uncertainty, in private data When a violator gains more information, entropy decreases Metric: Compare the current entropy value with its maximum value The difference shows how much information has been leaked 22 Dynamics of Entropy Decrease of system entropy with attribute disclosures (capturing dynamics) H* Entrop y Level Disclosed All attributes attribut (a es (b (c) (d ) ) ) When entropy reaches a threshold (b), data evaporation can be invoked to increase entropy by controlled data distortions When entropy drops to a very low level (c), apoptosis can be triggered to destroy private data Entropy increases (d) if the set of attributes grows or the disclosed attributes become less valuable – e.g., obsolete or more data now available 23 Quantifying Privacy Loss Privacy loss D(A,t) at time t, when a subset of attribute values A might have been disclosed: D( A, t ) H * ( A) H ( A, t ) H*(A) – the maximum entropy Computed when probability distribution of pi’s is uniform H(A,t) is entropy at time t | A| H A, t j 1 wj i pi log 2 pi wj – weights capturing relative privacy “value” of attributes 24 Using Entropy in Data Dissemination Specify two thresholds for D For triggering evaporation For triggering apoptosis When private data is exchanged Entropy is recomputed and compared to the thresholds Evaporation or apoptosis may be invoked to enforce privacy 25 Entropy: Example Consider a private phone number: (a1a2a3) a4a5 a6 – a7a8a9 a10 Each digit is stored as a value of a separate attribute Assume: Range of values for each attribute is [0—9] All attributes are equally important, i.e., wj = 1 The maximum entropy – when violator has no information about the value of each attribute: Violator assigns a uniform probability distribution to values of each attribute e.g., a1= i with probability of 0.10 for each i in [0—9] 9 10 H ( A) w j * 0.1 log 2 0.1 33.3 j 0 i 1 26 Entropy: Example – cont. Suppose that after time t, violator can figure out the state of the phone number, which may allow him to learn the three leftmost digits Entropy at time t is given by: 10 9 H A, t 0 w j 0.1 log 2 0.1 23.3 j 4 i 0 Attributes a1, a2, a3 contribute 0 to the entropy value because violator knows their correct values Information loss at time t is: DA, t H * A H A, t 10.0 27 Anonymization Anonymization is the process of removing or modifying the identifying variables contained in the dataset. Identifying variables are those describe characteristic of a person Direct identifiers, which are variables such as names, addresses, or identity card numbers. Indirect identifiers, shared by several respondents, of one of them. and whose combination could lead to the re-identification Anonymizing the data consists in: determining which variables are potential identifiers modifying the level of precision of these variables to reduce the risk of re-identification to an acceptable level. 28 Anonymization: Statistical disclosure limitation Statistical disclosure limitation techniques can be mainly classified in two categories: data reduction: increasing the number of individuals in the sample sharing the same or similar identifying characteristics data perturbation: a twofold perspective. data are modified, re‑identification is harder and uncertain. synthetic data: alternative approach to data protection, and are produced by using data simulation algorithms. 29 Anonymization Data Reduction Removing variables: removal of direct identifiers from the datasets (race, religion, HIV, etc.) Removing records: adopted as an extreme measure of data protection when the unit is identifiable in spite of the application of other protection techniques. Global recoding: aggregating the values observed in a variable into pre-defined classes (for example, recoding the age into five-year age groups, or the number of employees in three-size classes: small, medium and large). 30 Anonymization Data reduction Top and bottom coding: A special case of global recoding that can be applied to numerical or ordinal categorical variables. The variables "Salary" and "Age". The highest values of these variables are usually very rare and therefore identifiable. Local suppression: Local suppression consists in replacing the observed value of one or more variables in a certain record with a missing value. 31 Anonymization Data perturbation Micro-aggregation: replace an observed value with the average computed on a small group of units Methods include individual ranking and multivariate micro‑aggregation. sort the units according to their similarity 32 Anonymization Data perturbation Data swapping: a perturbation technique for categorical microdata, and aimed at protecting tabulation stemming from the perturbed microdata file. Altering a proportion of the records in a file by swapping values of a subset of variables between selected pairs of records (swap pairs). Post-randomization (PRAM): induces uncertainty in the values of some variables by exchanging them according to a probabilistic mechanism. Randomized version of data swapping. Adding noise: Adding a random value ε, with zero mean and predefined variance σ2, to all values in the variable to be protected. Generally, methods based on adding noise are not considered very effective in terms of data protection. Resampling: a protection method for numerical microdata that consists in drawing with replaceming t samples of n values from the original data, sorting the sample and averaging the sampled values. Data protection level guaranteed by this procedure is generally considered quite low. 33 Anonymization Synthetic data An alternative approach to data protection, and are produced by using data simulation algorithms. They do not pose problems with regard to statistical disclosure control because they do not contain real data but preserve certain statistical properties. Users cannot be confident of the results of their statistical analysis. But this approach can also help to producing “test microdata set 34 Working with XLSTAT in MS- Excel Goto https://archive.ics.uci.edu/datasets UCI Machine Learning Repository Pick IRIS Dataset and download the data file 35 Dataset Characteristics 36 Data Anonymization- Importing Iris data 37 General and Optional setting 38 Missing data & Outputs setting 39 Data Anonymization - Sequential 40 Data Anonymization - Random 41 Working with ARX Anonymization Tool 42 Importing CSV data 43 Analyze Risk 44 Anonymization tool 45 Anonymization tool 46 Classification using XLSTAT 47