US Private Sector Privacy Chapter 03p3.pdf

MGT 6727 (Spring Semester 2024) at Georgia Tech Chapter 3 – as of 01/15/2024 © IAPP anonymous data is “in a form that does not identify individuals and where identification through its combination with other data is not likely to take place.” 62 By contrast, pseudonymization is “the process of distinguishing individuals in a dataset by using a unique identifier which does not reveal their ‘real world’ identity.” 63 To understand pseudonymization, consider the medical records of an individual patient, where the patient is described as Patient 13579. Some organization, such as the hospital that treated the patient, assigned that patient the particular number. That hospital may be able to link the patient number to the actual identity easily. On the other hand, medical researchers elsewhere only see Patient 13579, and the identity of the patient is thus masked when the records are used in a research study. (As discussed below, it may be possible for recipients of the records to reidentify the data, but the pseudonyms rather than the names are provided to the outside medical researchers.) We note that European Union law treats pseudonymized information as personal data covered by the GDPR, in contrast to anonymized data, which falls outside of GDPR. 64 3.5.1.1 Strong versus weak identifiers and linkability 65 It is far from simple, however, to determine when data is truly de-identified. Some information is considered clearly identifying, such as a Social Security number or passport. These are called strong identifiers. Names can be strong identifiers, although common names may not be uniquely identifying. Identifiers that are used in combination with other information to determine identity are weak identifiers. A related concept is quasi-identifiers, or data that can be combined with external knowledge to link data to an individual. 66 Date of birth is an example of a quasi-identifier that can often assist in identification – there are 366 possible birthdays in a year (including leap year), and the population is spread out over more than 80 years, so that each unique date of birth is one of over 25,000 cells (366 times 80) in a spreadsheet. Even in a large population, only a small portion (perhaps 1 in 25,000) share the exact same birthdate. If a data set provides an individual’s date of birth, and you know the individual’s date of birth, then it becomes relatively easy to learn the other information in the data set about that individual. A related topic is linked data (or identified data) versus linkable data (or identifiable data). An example of linked data is having both a bank account number and the person’s name. If someone other than the bank only has the bank account number, however, is that truly identifying? Having the bank account number alone is an example of linkable or identifiable data – the bank account number gives us a strong clue to a unique account owner, but we may not have a method to learn the name of the account owner. In a 2016 speech, FTC Chair Edith Ramirez addressed the definition of “personally identifiable.” FTC Chair Ramirez stated, “We now regard data as personally identifiable when it can be reasonably linked to a particular person, computer, or device. In many cases, persistent identifiers, such as device identifiers, MAC addresses, static IP addresses, and retail loyalty card numbers meet this test.” 67 Online search engines often enable linkability, such as when a quasi-identifier can be linked to an individual using publicly available information. 19 NOT FOR DISSEMINATION The materials in this course are provided only for the personal use of students in this class in association with this class. MGT 6727 (Spring Semester 2024) at Georgia Tech Chapter 3 – as of 01/15/2024 © IAPP 3.5.1.2 Approaches to de-identification Many approaches to de-identification rely on one or more of three techniques to hide identity. 68 The simplest approach is suppression, by removing identifying values from a record. One example is where part of the organization is doing machine learning or statistical analysis of a customer data set. Those performing the calculations are looking for statistical patterns and generally have no need to see the customer names or other strongly identifying information such as phone number. Some types of data are amenable to generalization, where a detailed data element is replaced by a more general data element. For example, using year of birth rather than date of birth greatly reduces identifiability – the population then has roughly 80 categories (if persons over 80 years of age are placed together), rather than the 25,000 categories for date of birth. A common example of generalization is to provide less granularity for location data, such as by revealing only the municipality rather than a precise GPS coordinate, which might be an individual’s residence or place of work. A third approach is noise addition. In this approach, actual data values are replaced with other values from the same class of data. Often, the noise addition seeks to preserve statistical properties of the data, such as the average value, while disrupting the ability of outsiders to spot the data associated with a specific individual. For instance, suppose there is a survey or other report that contains detailed information about each participant, including their precise annual salary. An employer might know the annual salary from payroll records, and then seek other information from the survey about the individual. With noise addition, that precise annual salary information no longer appears in the report, blocking such efforts. 3.5.1.3 The risk of re-identification and differential privacy Computer scientists have had surprising success at re-identifying data that appeared to be deidentified. 69 In an early study by Latanya Sweeney, she was able to uniquely identify the governor of Massachusetts from public voter files that contained only gender, zip code, and date of birth. (The study illustrates the extent to which precise date of birth is highly identifying.) 70 Since then, an entire academic field has developed on techniques for re-identifying supposedly anonymized information. 71 New privacy professionals should thus be alert to the possibility, for a data set that is supposedly de-identified, that there may in fact be technical means to re-identify at least some of the data. In part in response to these re-identification attacks, researchers have developed differential privacy, which is a mathematical definition of privacy in the context of statistical and machine learning analysis. 72 Here is the definition of the mathematical guarantee from differential privacy: “anyone seeing the result of a differentially private analysis will essentially make the same inference about any individual’s private information, whether or not that individual’s private information is included in the input to the analysis.” 73 Differential privacy defines the necessary amount of statistical noise that must be added to a data set, to meet the desired level of privacy for a specific set of queries. 20 NOT FOR DISSEMINATION The materials in this course are provided only for the personal use of students in this class in association with this class. MGT 6727 (Spring Semester 2024) at Georgia Tech Chapter 3 – as of 01/15/2024 © IAPP Researchers continue to explore the set of circumstances where differential privacy can apply. The U.S. Census used differential privacy for certain data sets as part of the 2020 Census. 74 For more complex data sets, however, the Census decided in 2022 that application of differential privacy would not yield sufficiently useful data to be worthwhile. 75 Overall, there will likely be an increasing range of applications for differential privacy in the coming years. 3.5.1.4. Conclusion on de-identification Privacy professionals will continue to face issues of de-identification and re-identification. The entire effort to protect privacy depends on defining what is “in scope” – covered by privacy rules – rather than de-identified/anonymized and thus “out of scope.” Most global privacy regulations are not applicable for data that cannot reasonably be traced back to an individual. Internationally, the terms used to describe the boundaries of personal data rights vary as well as there are different approaches to the appropriate risk assessment, and which threats for re-identification need to be taken into account. Definitions and interpretations of the relevant legal terms vary tremendously in different jurisdictions. The longest-standing de-identification rules in the U.S. are under HIPAA, concerning what counts as Protected Health Information. There are two methods for de-identification under HIPAA. Under the safe harbor, covered entities must eliminate 18 specific types of potentially identifying information. For instance, postal codes can be no more specific than the first three digits of a five-digit ZIP Code. Alternatively, de-identification can be achieved via the “expert method,” under which an expert determines and documents that “the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual.” 76 Many new re-identification attacks have developed, however, since the promulgation of the HIPAA rule over twenty years ago. Outside of the HIPAA context, privacy professionals often look to non-binding guidance from the FTC, which has stated that data is not “reasonably linkable” to the extent that a company: “(1) takes reasonable measures to ensure that the data is de-identified; (2) publicly commits not to try to re-identify the data; and (3) contractually prohibits downstream recipients from trying to re-identify the data.” 77 In conclusion, privacy professionals should be alert to the possibility that others in an organization may wish to use data without privacy protections, while asserting incorrectly that the data is actually anonymized. Such data use creates risk for the organization. In 2022, a senior FTC official stated about location, health and other sensitive information that “claims that data is ‘anonymous’ or ‘has been anonymized’ are often deceptive.” The official concluded that “companies that make false claims about anonymization can expect to hear from the FTC.” 78 3.5.2 Encryption and Other Shielding of Data Anonymization and other altering of the original data leaves data available for processing, but does so in a way that makes the data anonymous or less identifiable. By contrast, encryption 21 NOT FOR DISSEMINATION The materials in this course are provided only for the personal use of students in this class in association with this class. MGT 6727 (Spring Semester 2024) at Georgia Tech Chapter 3 – as of 01/15/2024 © IAPP shields data, to prevent unauthorized persons from accessing it, but enables the data to be read in its original form if an authorized person uses a key. Shielding of data can occur at different stages. Encryption in transit is encrypted between the sender (conventionally named “Alice”) and the recipient (conventionally named “Bob”). Encryption in transit is commonly used on the Internet, to prevent “man in the middle” attacks where a malicious actor could read the communication as it moves from Alice to Bob. Encryption at rest means that data is stored in a system when it is not being used. For instance, a well-encrypted hard drive means that the data remains secure even if a malicious actor steals the hard drive; under most data breach laws, such encryption creates an exception from the requirement to report the breach. Chip manufacturers have sought to develop trusted execution environments, also called secure enclaves, within the processors themselves. to provide stronger privacy guarantees. 79 In some instances, there is also encryption in use, although many types of processing need to see the actual data – the plaintext – in order to do the processing. Research continues on expanding various techniques for operating with encrypted data. These topics include secure multi-party computation 80 and fully homomorphic encryption, which would enable processing of data while the data remains in encrypted form. 81 Details on these topics, and important tools such as zero-knowledge proofs, are beyond the scope of this chapter. 82 3.5.2.1 Encryption Encryption has become a pervasive privacy-enhancing technology. 83 The use of encryption has grown enormously since the spread of the Internet and society’s reliance on software and computing for so many tasks of everyday life and commerce. The discussion here can only provide a brief introduction to the topic. Encryption is a reversible process that converts the original plaintext into ciphertext (data that is scrambled and cannot be read). Decryption converts the ciphertext back to the original plaintext. A cryptographic algorithm takes the original plaintext and performs mathematical operations on it to create the ciphertext. To achieve decryption a user must have (or guess) a key. The key is a string of characters (the longer and more complex the key, the stronger the security). When the cryptographic algorithm applies the key to the ciphertext, the plaintext becomes available. Symmetric key cryptography (also called private-key cryptography) uses the same key both to encrypt and decrypt data. Symmetric keys are relatively short, and they are often faster and less resource-intensive than other approaches. The big disadvantage, however, is that Alice has to share the symmetric key with Bob in order to enable Bob to read the message. That means Alice must trust Bob with the key, and find a secure way to provide the key to Bob, even if he is far away. 22 NOT FOR DISSEMINATION The materials in this course are provided only for the personal use of students in this class in association with this class. MGT 6727 (Spring Semester 2024) at Georgia Tech Chapter 3 – as of 01/15/2024 © IAPP With the rise of the Internet, asymmetric cryptography (also called public-key cryptography) has become widespread. In this approach, a pair of keys is generated for each user. One of the keys is public – the whole world sees it, like a telephone number in a traditional phone book. The other key is private – only the user knows it, and the user never needs to share the private key with others. When Alice wants to send a message to Bob, she encrypts it using his public key; Bob then uses his private key to convert back to plaintext. Public-key cryptography is scalable – Alice’s computer can look up and use the public key of recipients all over the world, such as to send an encrypted message or log into a website. Asymmetric encryption also enables digital certificates, widely used in the authentication process. Alice can encrypt a message using her private key. Then, only Alice’s public key can reveal the underlying plaintext. When the plaintext becomes readable, Bob can verify that the communication came from Alice’s private key. A certificate authority (CA) is a trusted third party that validates a person’s identity and either generates a public/private key pair on Alice’s behalf or associates an existing public key provided by Alice to that person. 84 Once a CA validates someone’s identity, they issue a digital certificate that is digitally signed by the CA. The digital certificate can then be used to verify a person associated with a public key. Public key infrastructure (PKI) refers to the policies, standards, people, and systems that support the distribution of public keys and the identity validation of individuals or entities with digital certificates and a certificate authority. 3.5.2.2 Hashing Hashing is a cryptographic function that transforms an input into a new output of alphanumeric characters. The goal is to have a one-way function – the hash will work the same way each time the algorithm operates on the original file or other input; however, seeing the output should not reveal anything about the original input. To get the idea, consider how a potato can be converted into hashed brown potatoes. That is a one-way process, however – you can’t convert the hash back into a potato. Hashing, similar to encryption, transforms data into an unintelligible format. Hashing in some instances is used to protect privacy, notably by creating a pseudonym. For instance, hashing might be applied to a patient’s name and date of birth, and all the records associated with that patient would be accessed by using the hash, with no visibility for the name and birth date. More simply, some organizations have used a hash on a relatively short string such as a person’s Social Security number. This approach to de-identification, however, may not work in practice. For instance, an attacker might create a table that shows the hashed outcome for each possible Social Security number. 85 Then, the attacker may be able to look up the Social Security number matched with each hash. There are strategies to make this look-up strategy more difficult. Notably, the organization performing the hash can add salt to the hash, which approximates the use of a key for encryption. Privacy professionals should be aware both of the possibility of using hashing to shield data, and the risk that attackers may be able to circumvent that shield. 23 NOT FOR DISSEMINATION The materials in this course are provided only for the personal use of students in this class in association with this class. MGT 6727 (Spring Semester 2024) at Georgia Tech Chapter 3 – as of 01/15/2024 © IAPP Hashes are often used to show the integrity of the communication – that the content of the communication has remained the same between Alice and Bob, because the hashed value is the same before and after communication. For a digital signature, Alice uses her private key to create a string of characters. Bob then applies Alice’s public key to the communication. If the public key works – if it creates a readable plaintext – then Bob learns that the message received is the same as the message sent. 3.5.3. Hardware and other shielding of data In addition to the use of encryption and hashing, there are other technologies that can shield data. Some technologies are simple and physical, such as a physical cover over a webcam or a privacy screen that makes it difficult for a stranger to observe your laptop over your shoulder or from a nearby seat. Other technologies prevent data from being accessed except under specified conditions. For instance, a biometric reader on a laptop or other device blocks access except by the authorized user. More generally, the device may only operate when certain hardware or software conditions are met, such as anti-interdiction mechanisms that detect tampering and prevent operation of the device when tampering is detected. 86 3.5.4 Conclusion on privacy enhancing technologies The discussion here on some key categories of PETs has focused on ways of altering data (e.g., de-identification) and ways for shielding data (e.g., encryption, hashing). Privacy professionals should be alert to the development of a variety of other PETs in the future, as researchers and regulators seek to expand technical measures to protect privacy. 87 Reduction of privacy risk also can come from effective systems and processes for data activities, many of which are discussed in Chapter 4 on privacy management. The privacy professional should keep in mind that weak organizational measures can undermine the quality of technical measures. In practice, most encryption is cracked due to a mistake in implementation rather than a flaw in the encryption algorithm itself. 88 Overall reduction of privacy risk comes from a combination of technical measures, such as encryption and hashing, as well as organizational measures, such as limiting which employees gain access to data. 3.6 Cybersecurity Security is an essential component of providing privacy protection – the best privacy policy will not protect data if it is easy for malicious attackers to grab the data. That is why all of the versions of Fair Information Practices in Chapter 1 include a security principle providing that organizations should use reasonable administrative, technical and physical safeguards to protect personal information against unauthorized access, use, disclosure, modification and destruction. In practice, privacy professionals often coordinate closely with security professionals in their organization. This part of the chapter cannot possibly teach all aspects of cybersecurity; instead, 24 NOT FOR DISSEMINATION The materials in this course are provided only for the personal use of students in this class in association with this class. MGT 6727 (Spring Semester 2024) at Georgia Tech Chapter 3 – as of 01/15/2024 © IAPP the goal is to familiarize the reader with some key concepts, enabling more effective communication with cybersecurity professionals where appropriate. 3.6.1 Confidentiality, integrity, and availability (“CIA”). In considering privacy and security together, a useful first approximation is that privacy means deciding which uses of personal data are authorized, while security means preventing unauthorized access to data. 89 More broadly, computer security traditionally addresses “CIA” – confidentiality, integrity, and availability. In this setting, confidentiality means protecting information from unauthorized access. Integrity means data are trustworthy, complete, and have not been accidentally altered or modified by an unauthorized user. Availability means data are accessible when you need them. 90 Computer security, often called cybersecurity or information security, thus adds protection of integrity and availability to the privacy/confidentiality protections against unauthorized access. 3.6.2 The NIST Cybersecurity Framework Since its first publication in 2014, one of the most important cybersecurity documents has been the NIST Cybersecurity Framework, often abbreviated CSF. 91 (NIST is the U.S. federal National Institute of Standards and Technology.) Although the CSF is guidance rather than a set of legal requirements, it provides “a set of industry standards and best practices to help organizations manage cybersecurity risks.” 92 The CSF popularized five Framework Core Functions designed to assist an organization to address ever-changing cybersecurity risks. NIST emphasizes that all five functions should operate “concurrently and continuously to form an operational culture that addresses the dynamic cybersecurity risk.” 1. Identify. Organizations should develop the understanding to manage cybersecurity risk to systems, assets, data, and capabilities. Organizations need to manage assets, understand their business environment, and assess risks. 2. Protect. Organizations should develop and implement the appropriate safeguards to ensure confidentiality, integrity, and availability. 3. Detect. Organizations should develop and implement appropriate activities to identify the occurrence of a cybersecurity event. For instance, organizations should identify anomalous activities, and follow up on anomalies in order to determine whether they indicate a compromise. 4. Respond. Organizations should develop and implement appropriate activities to take action regarding a detected cybersecurity event. These activities are often called incident response, where organizations may be required to provide notice to government agencies and affected individuals, as discussed in Chapter X on data breaches. 25 NOT FOR DISSEMINATION The materials in this course are provided only for the personal use of students in this class in association with this class. MGT 6727 (Spring Semester 2024) at Georgia Tech Chapter 3 – as of 01/15/2024 © IAPP 5. Recover. Organizations should develop and implement the appropriate activities to maintain plans for resilience and to restore any capabilities or services that were impaired due to a cybersecurity event. In short, the CSF provides a systematic, well-known, and relatively concise source of information for privacy professionals seeking to learn more about cybersecurity or about how to integrate the privacy and cybersecurity activities in an organization. 3.6.3 The adversarial mind-set An organizing principle in learning cybersecurity is to adopt the adversarial mindset. 93 Many of us go through our daily lives without feeling like we are constantly under direct attack by malicious actors. In cybersecurity, however, we all live in a “bad neighborhood” 94 – the attackers, from all over the globe and at any second, can unleash a potentially devastating attack on our system. To address this unceasing risk of attack, organizations must perform threat modeling, to identify the most salient risks for the particular organization. 95 Along with the widely used MITRE ATT&CK Framework, 96 a helpful mnemonic is the STRIDE framework for modeling computer security threats. 97 That stands for: (1) “spoofing,” which is an attempt to undermine authentication; (2) “tampering,” or changes to the desired hardware and software specifications; (3) “repudiation,” which can occur when an application or system does not adopt controls to accurately track users’ actions; 98 (4) “information disclosure,” or the loss of private information; (5) “denial of service,” such as when a website becomes inoperable due to a bombardment of attempts to log onto the site, in what is called a “distributed denial of service attack”; and (6) “elevation of privilege,” which can occur when an attacker gets inside an organization’s firewall and tries to gain additional privileges to manipulate the computer system. If the attacker gains the greatest authority of any user of the system, that is often called “root access” to the system. The process of cybersecurity threat modeling can differ in practice from other aspects of compliance. Many compliance programs have a check-list of required actions, or do a gap analysis of where the current state differs from the organization’s goals. Such compliance lists and gap analyses occur in cybersecurity as well, of course. With that said, the adversarial mindset highlights the possibility that a single failure can lead to catastrophic consequences – the attacker gaining control of the entire computer system. Over time, given the possible consequences of a single failure, cybersecurity defense has placed more emphasis on the concept of resilience – the ability of an organization to bounce back from an intrusion or other malicious action. 99 In 2022, the U.S. government announced the goal of adopting the “zero trust” approach: “The foundational tenet of the Zero Trust Model is that no 26 NOT FOR DISSEMINATION The materials in this course are provided only for the personal use of students in this class in association with this class. MGT 6727 (Spring Semester 2024) at Georgia Tech Chapter 3 – as of 01/15/2024 © IAPP actor, system, network, or service operating outside or within the security perimeter is trusted. Instead, we must verify anything and everything attempting to establish access.” 100 The goal is that “all traffic must be encrypted and authenticated as soon as practicable,” including internal traffic. With the zero trust approach, the idea is that a single intrusion will be able to cause only limited damage – even if the adversary is in the system, there will be a limited scope of harm. The adversarial mindset also is consistent with other foundational aspects of cybersecurity. Because defenders must be on the lookout for serious attacks, each user should only receive “least privilege” – the most limited scope of action on the system that can get the user’s job done. To implement the least privilege principle, organizations strive to build role-based access controls – a doctor or nurse might need the detailed medical record, while the cafeteria only needs to know about a patient’s low-salt diet. Legal requirements related to privacy, such as the Health Insurance Portability and Accountability Act (HIPAA) Security Rule (discussed in Chapter 8), often require these role-based access controls. 101 In addition, because attackers may get past the first line of defense, organizations try to build “defense in depth,” so that an initial intrusion still faces multiple obstacles before harm occurs. Because attackers may try to gain advantage at the time new hardware and software is installed, it is important to have “security by default,” such as strong passwords even upon initial use. 3.6.4 Conclusion on cybersecurity Some aspects of cybersecurity closely parallel important privacy principles. The idea of “security by default” is similar to “privacy by default.” For new systems, there should be thorough consideration of risks, through both the security development life-cycle and the privacy development life-cycle. Implementing role-based access controls is a standard part of both security and privacy programs. One difference is the nature of the threat. In cybersecurity, the most common perspective is that of the system owner – keep the adversary out of the system. In privacy, however, it is far more common to adopt the perspective of the data subject – the individual user. The company, agency, or other organization may itself be seen as a source of risk, such as if a company acts unfairly or deceptively toward the individual. As discussed above, the simplest distinction may be that security means preventing unauthorized access to data, while privacy often means deciding which uses of personal data should be authorized. 3.7 Conclusion This chapter seeks to introduce the reader to key terms and concepts related to technology and privacy. For the non-technical reader, learning this terminology can serve multiple uses, not least of which is easing the fear of sitting in a meeting when others are talking about things that you do not understand. When you have at least the basic vocabulary for these technical issues, it becomes far easier to follow up and learn more when that is necessary. 27 NOT FOR DISSEMINATION The materials in this course are provided only for the personal use of students in this class in association with this class.

US Private Sector Privacy Chapter 03p3.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue