Full Transcript

about the data science life cycle. And there were four components that we looked into here, problem formulation or definition, data collection, operation, in, data collection, data analysis, which is either, let\'s say we\'re a statistical, ed. And then what was important also is the context in whi...

about the data science life cycle. And there were four components that we looked into here, problem formulation or definition, data collection, operation, in, data collection, data analysis, which is either, let\'s say we\'re a statistical, ed. And then what was important also is the context in which this data science analysis is done, what the actual environment is, what the population is, what kind of problem we\'re trying to solve and so on. And then finally, This is actually sort of a circular path through these different phases. Sometimes you go through the many times because you realize when you collect the data that Actually the problem you\'re trying to solve, there\'s no data for it, so you should really think of a problem in vice versa. And then finally this comes down to using this insight for making decisions. And we saw that at each step in this process there was potential for impact from privacy conditions, privacy wishes, desires, and also other ethics such as fairness and bias. So at each step here, these factors come into play. You might be collecting data in a private way. The data you\'re collecting might be imbalanced. So every step here, every phase here has potential for ys not something you just think of at the end when you\'ve already built your model but it\'s in every step of the way when you\'re thinking about what problem you want to solve all the way to your decision with that machine learning model. Ed the privacy module. Here, I started talking about organization. Particularly in terms of data sharing. So this is the situation where someone wants to release a data set or share a data set. And we saw that even an authorized data sharing is oes. This was the Netflix IMDb clip of the tack and the other one was\... The governor of Massachusetts, the health records that were released by the state of Massachusetts. Ysation data. Essentially these leakage attacks showed that just anonymizing the PII or the personally identified information is not sufficient. We looked at the reasons why anonymization does not work. Do you remember those reasons? Ys, because from quasi identifiers it is easy to know information about some of the data set, even if their personal information is kept hidden. Ys help me yeah so data sort of exists in different sources age Yes, so you have high dimensions of data, and because of the large dimensions of the data, to the size of the data set, you may have any combinations that lead to identification. There are other things that matter too, like the range of the feature values, which is, again, a similar concept to the high dimensionality just that you have so many more unique combinations possible and that range of a feature value is so large you know if it\'s binary you\'re basically hiding within half the population but there\'s as 100 possible values for that same future then you know it\'s hard to find you in that 100 compared to the binary right, So it\'s the same concept that dimensionality, the range allows for unique combinations. Did you have a question? Would it be easier to find you if there was a larger range of variables? Yeah, exactly, yeah. And it\'s easier to find you because there are unique combinations possible. Okay. And one of the ways that people tried to address this is by releasing tables that were K-anonymates. So there\'s a concept of K-anonymity that was developed. So here the idea was that he release a table for data set. And what does that mean? Okay, anonymous. It means that every individual in that data set is in a group of size K where their quasi identifiers are identical. And so they are indistinguishable in terms of the quasi identifiers in that group of K. And how do you achieve K anonymity? Like you\'re given this particular data set how do you achieve can and emity in that data set And this is, you know, where you truncate the last three digits of the puzzle code, or you need a range, you need a value, you know, basically put values into bins so that you have pure values. And one of the reasons you could be identified as a range of values is essentially generalizations. E. However there were still issues of KN and Emission. There were many reasons that we don\'t really like it, I guess, yeah. Homogeneity. Homogeneity, yeah, so we have tax. Ed at tax where in a given equivalence group you may have that protected or sensitive feature which has the same value in which case you\'re not really hiding in that group of K. And was there another type of attack that we had to attack? Background knowledge? Yes there was a background knowledge attack where given some background knowledge you might be able to identify individuals within this king. This cave and other than the attacks, the other limitations were really that this is kind of an ad hoc method ed. So the next step after can and I am ready to address these limitations of homogeneity on the tax specifically. Tion of homogeneity, the point of end of L diversity? How did it address the homogeneity attack limitation of K and NMD? Well, you need to please L values in that column? Yeah, so you wanted enough diversity in that target column, in that protected or sensitive column. And that diversity was measured in two ways, either distinct, you need those L different types of values or some probabilistic way over distribution. D, but if the features are semantically similar then we have the sub reach. Anything else? Stimulus attack where you may have diversity but if it\'s all skewed and one part of the distribution. Ed and again it has still has the same issue as can and everything which is that is kind of the ad hoc method just kind of loose things around play with it. There\'s no robust algorithm to, you know, lead you to it for the elderest table, you just sort of try things out. Are they called similarity attacks or similarity attacks or does it matter? I think the igly means that you\'re using the semantic meaning of those feature values to show similarity or to assume similarity and therefore you\'re learning something. Okay and then to address these issues of mostly of you know spewness and semantics to some extent the next I ed the idea that was proposed is tea closeness. And what was the notion behind tea closeness? You can make assumptions about a person even though you don\'t. You can\'t use a homogeneity attack, but you can make assumptions about a person because the values are a few things that are a few things. Y and that\'s essentially what the somatic and skewness attacks were. What does t t closeness mean? Now, I remember these things, but there\'s also this thing. And also, the things that you do in both set versus the thing with groups. And so the difference between the two groups and their execution would be that much different. So avoid any behavior fluctuations. Yeah? E, in that equivalence group should be close to, let\'s call it closeness, it should be close. And then distribution, so there should be close, close in distribution where that closeness is defined by that total variation distance that we have and the back leaf distance. Okay, so closeness, that\'s the key word, it\'s back leaf. That Distribution in each group is close to the distribution of the voltage. And so there were still, you know, the same issues about it being I talk and not portable and all that still applies. So rather than sharing the data itself, we talked about what if you just share aggregate statistics rather than data. So you keep the data private but you share only aggregate statistics of that table. So you el- So you, a lot of queries, for these aggregate statistics queries of a kind, what\'s the average age of something, or, what\'s the average igies of women, what\'s the most common disease in the group of Caucasian females, various queries of that sort, which are essentially statistical queries. So not asking about a specific person or specific row, you\'re asking about statistical measures like averages, means and so on. And that could be over certain whoops, but there\'s statistical points that you allow. And there was still an issue with sharing aggregate statistics. And what was that? Reconstruction. Exactly. So you could still have reconstruction. Apparently I can\'t speak in are the kind of attacks where you take your aggregate statistics the answers to these statistical queries and if you have sufficient number of them then you\'re able to write down a set of linear equations that allow you to reconstruct the data space or the data set that satisfies those linear equations. So you have the statistics that come from that data set, you write those down as conditions and then you try to reconstruct a data set that satisfies those conditions. And you can do it by hand for small tables, but actually there are very powerful SAP solvers that allow you to do that for large data sets. This is a possibility, it is possible for someone to take our good statistics from large data sets and try to do it. Then we suggested that maybe we should go a step further instead of just sharing our good statistics. Ing the model and then only share the model and by sharing the model you\'re just sharing the parameters of that model. So if your machine learning model for example is learning a predictor function you\'re just sharing parameters of that function. It\'s a linear function with two parameters they\'re equal to blah blah blah. So that\'s all you\'re sharing. You\'re sharing nothing about the java. Ed by the police. And there were other attacks possible with this and those were called. I don\'t remember the name of it but it was when you can get the machine learning model to spit out the training data it was given. Yeah, so essentially you\'re trying to infer membership of a data point in that data set. So that\'s a membership inference attack. Yes. Membership inference attack, which allows you, basically it\'s an attack where you create an attack model, which is a shadow model that kind of emulates this released machine learning model. Dhip. And that was seen as a privacy breach because knowing that that data point was in that training set. And that was seen as a privacy breach because knowing that that data point was in that training set, then by looking at the outputs of the machine learning model, you may be able to, you know, infer certain target values of that person by looking at the output of the machine learning model. Okay, so then we looked at ways of sharing data or allowing such things but in a privacy preserving way where the designer of the model or the surveyor is allowed to calculate whatever they want to calculate but individually you still have privacy protection and the very simple and early example of that that we saw was the randomized response and this was for binary values yes or no answers, zero one values where you use a coin, but really some randomization to randomize your true answer in a given way. There\'s a set protocol that you follow to randomize your answer. And that means that you, that gives you what\'s called plausible deniability, which means that because your answer has some randomization in it, you are not really incriminating yourself with the answer that you give. No one knows whether it is a true answer or it is the answer due to randomization. However the collectors, the analysts are still able to make their analysis and insight because ed when they collect the randomize answers from a large population, then they\'re able to have an estimate for what the true global answers should be. And that was for answering questions, answering binary values and so on. But we want to still talk about the other types of data analysis we want to do, like aggregate statistical queries, machine learning models and so on. And so starting with that idea of plausible deniability, we came to this sort of definition of privacy that researchers agreed with called differential privacy and the reason they agreed with it is because they needed something concrete, they needed something that was provable, you know, that was mathematically or probabilistically provable And that was this definition of differential pricing. And what\'s the essence of this definition of differential pricing? Yeah. Well, the participation of an individual in a data set should not change the outcome, I don\'t know. Exactly. So if the participation of an individual in the data set does not change the outcome then any analysis that is done on that data set or on that model doesn\'t harm that individual because, you know, you have no idea whether that individual was in that data set their presence does not change anything so that is one way of hiding that person in there so and the definition of that was that essentially you have a randomized algorithm, what we call it, M. This could be a machine learning model which is taking as input a data set to D. It could be a statistical query which is taking as input that particular, so M is the statistical query and D is the data set that that query is run on or M is that change machine learning model and D is the data set that it\'s run on. It\'s some randomized algorithm that\'s run on data set D and s is any output in the range of outputs of this model and d prime is some ing data set such that they differ in one row ed. This is just a more formal way of writing that because currently D is essentially the universe of all possible data sets for those features. D and D prime are members of that universe and the neighboring if their norm, their difference here is less than one, is bounded by one. And that\'s just a formal way of saying that there\'s one row which is different between the two datasets. Okay. And so for any such pair of neighboring datasets for differential privacy, the definition of differential privacy is that probability of the output from D, probability of that output being S, and probability of the output from D prime also being S, is bounded by exponential to the epsilon. And basically what this is is that this ratio, And this is sometimes an easier way to look at it. This ratio of these two probabilities is bounded by exponential of epsilon. And essentially, this bounds over the entire distribution of outputs. The two outputs from D and D prime. This bound must be satisfied for the entire support of that distribution. So for every output, ing data set and suppose the absolute value of d minus d is that double prime or is that a sub? Oh sorry that\'s just a, so that\'s not supposed to be that supreme, but just a formal way of writing that when you take any two pairs of neighboring, the neighboring data sets are defined such that any two pairs from this universe, their difference is bounded by one. And the soup is there because that\'s over every pair that\'s possible in this potential set of neighboring data sets. That\'s just a whole new way of writing what I just said, informally which is that two data sets are measuring if they differ in one row. If they differ in at most one row. So that\'s the definition of differential fantasy. And we looked at, I think this is where we ended last time, is we looked at, we mostly just listed the pre desirable properties of differential privacy that are useful to us because they allow us to create these machine learning models and other sort of machine learning pipelines that are still priors. The first one was what was one of the properties. We just looked at this on Thursday. What would you want a differential private mechanism, what kind of property do you think you would want in a differential private field? Ing property, the invariance to post processing, which means that once you have, once you\'ve built your mechanism and it\'s differentially private with a certain parameter, you can do any kind of post-processing on that. So when you take the output from that, you do whatever kind of other post-processing you want, you put it into a different function that\'s not private, whatever. Your privacy level does not go down. You still have that epsilon differentially private, even though it\'s gone through that second function or third function. It\'s invariant to post-processing. That privacy level will not change after processing with something else. And another one? Group privacy? Yeah, it also allows group privacy to be easily composed from this definition of individual privacy. Where instead of the neighboring data sets differing by at most one row, if the group privacy you want to insure is on a group of K individuals then these they bring data sets would be for by most K rows that\'s the test you would do on them and then and then there was a third one which was also important composition So if you have multiple of these privacy mechanisms they become more private? Yeah. So they compose gracefully. It\'s called graceful composition, which means that if you know, if you have multiple differentially deprived mechanisms, you can combine them in their combination, compose as gracefully, meaning that their privacy parameters add up. And these are important properties, especially when we talk about machine learning because you don\'t want you know trained your machine learning model and you\'re done and you want to put it out there. So if someone does something else to it, it doesn\'t matter. Your privacy is still, at least at the level that it is guaranteed when you\'re trained up. And then you may have more complicated machine learning models like neural networks where you\'re accessing the data that\'s in your training data set multiple times, you know, through gradient descent or back propagation or other procedures that are happening in the training of more complex machine learning models. And you want to be able to compose all those different things that you\'re doing to the data within this model and still be able to quantify what your privacy is at the end of that. So these are all important properties. OK, so that\'s where we ended last Thursday so this is really what will be on the test and I still have time so I\'ll start the next lecture any questions before I move on I think it only has the formulas of the distance measures for the probability distribution, the total variation, the KL distance. And there might be one other formula that there\'s not a formula so far, so that\'s it. Okay, so what am I looking for is basically what you want to know, right? As an answer for these questions. So what is elder diversity? Okay, somebody wanna answer that? What is elder diversity? The sensitive data distribution is elder diverse. What? Sensitive data distribution, what are you talking about? The data that we have to protect has to have a distribution that is held diverse, so it provides enough noise so that even if we are doing a K-dialogue, we won\'t be able to release any one extra data or the approximate information. Or make any sort of inference from that. OK, so first of all, it is a version of K anonymity. So you can say it is a version of or improvement on K k and 13, 2, 4, and 1 amine zane data. That\'s the first thing. That\'s like the basic thing. And then the second thing is that L diversity means that for every equivalence group in this K-anonymous table that you\'re releasing, which is a size K, you want the values of your sensitive or protected feature to be diverse with this parameter L. And the basic kind of diversity is that you have L unique values in there. So necessarily L will be less than K, right? So, in a anonymous. So you know, your answer was mostly right. I\'m just organizing it so that it\'s obvious to me what you\'re saying. In a key anonymous table or dataset for every equivalence loop. The diverse with parameter L, for example, L distinct values. Even if you had just said that the sensitive features in each equivalence group must have L distinct values, you know, that\'s good. So what I\'m looking for really is that you know that this is on a K anonymous table, that it is a method for anonymizing data, and that it, the values of the sensitive feature must be L-distributed. Okay, so those are the things that I\'m looking for. OK, what is a linkage attack? When you use a public data set and compare the rows of the value of the private dataset to infer information about individuals which you might not know just by looking at an analyzed data set. Yeah, so it\'s more than just inference. It\'s to re-identify people. So a leakage attack re-identifies or de-unmove a line attack, re-identify, or de-unline, and like this. The visuals in an anonymized dataset. Anonymized dataset. And the language attack is carried out by comparing rows of this dataset with a public dataset. In order to link data points or rows. So what\'s important here is that The result of a linkage attack is that people have been de-anonymized or re-identified from an anonymous data set, and that it happens by comparing with a public data set. Set. And you know you can give also an example if you like. You\'re not sure that you demonstrated what you know in the way that you\'ve written it, you can also just give an example. So for example, you can describe the Netflix one or the Massachusetts one. If you describe that example, then that is worth something as well. OK. Any questions about that? OK. And this question is a true-false question. Anonymizing data by removing features that describe a person, like name, date of birth, post-reco-age, gender, race, is sufficient to guarantee privacy? Is this statement true or false? Yeah, false. And we know that because, what we just talked about earlier, the linkage attack. Okay, that was good. Okay, this one is about El Debrisity again, and there\'s this key anonymous table that has been released from, you know, some employer has gender, the department they work in and the pay grade. The department and gender in this table defined as the quasi-identifiers. And the sensitive column is the k-brake. And I\'m asking you for what value of l is the following table, l-diverse. So what would you need to do to determine what value of l is the l-diverse for first? Split it up into different classes. Yeah, so first you need to find that k, find that k for which this is k anonymous. So split it up into two equivalence classes. So let\'s do that. So first of all, you have two human resources here. But one is male and one is female. So obviously they\'re not part of the same equivalence class. So I\'ll call this class one and I\'ll call this class two. And then I\'ve got a completely different department here, research and development. And the first one\'s female, so that\'s another class. So that\'s another group, so that\'s three, and the next one is male, another one, four, and I\'ve got another female R&D, so I\'ve already got that, so that\'s group three, a male again, and that\'s group four, And then I\'ve got three in a row from legal and they\'re all male, so these are all in group five. I\'ve got a research and development female and that is three again, group three. I\'ve got two human resources again, one female and one male, and I\'ve already seen that before, so that\'s group one for that one and group two. And I\'ve got two legals that are female and they\'re both in the same group two. Oh sorry I\'ll call, I\'m naming the groups here, so I\'ll call this six, I guess, is what we\'re up to. So I have four group one, I have one and two people, group two, I have two people, group three, I have three people, group four, I have two, group five, I have three, and group six, I have two. Right? So that\'s what I come to. What is the k value for this table? For this k and on this table? What\'s the k value? Yeah. Six, why is it six? Yeah. Six. Six. Why is it six? This is the number here. Why is it six? Number of classes. Okay. Well, k anonymous, k anonymity, the k doesn\'t refer to the number of classes. It refers to the size of the class. So K is a number of people with whom any individual in a data set can be confused, can be indistinguishable. So k is the smallest equivalence class size. So it\'s 2. So I have k is 2. So l must be either 1 or 2. So how do I determine what L is for this table? L diversity. Yep. You need to go through each equivalence class and check what value is the equivalence. Yeah. So let\'s do that. So I\'ll put another column here, L. So for the first equivalence class, the females in human resource, one has a pay grade of four, the other has a pay grade of two, so L is two there. The second one, I have the two males in human resources and they are one and three, so I have two distinct values there. The third one, females in research and development, I have three, the one and five, so I have three. The fourth group is males in research and development, there\'s two distinct ones there. The legal department males have two distinct ones there, the legals apart, and females also have two distinct ones there. So the L value for this table is 2. It\'s just owners, but it\'s easy, right? Okay, the last question, the last question here is about the randomized response. So recall that we looked at randomized response method for answering binary queries, whether true answer can be either one or zero. And I\'m just putting that mecha is in here for you. You flip a fair coin and if it\'s tails, you respond truthfully, with its heads, another one, if it\'s tails, you respond zero, and if it\'s heads, you respond one. Right? OK, so now I\'m saying let\'s assume that the coin has a bias of 0.6 so that you get heads of the probability of 0.6 and tails of the probability of 0.4. So it\'s no longer the faint fair point. And both the respondents and the survey makers know this bias value. Is this still considered to be a randomized response method? Yeah. Yes, because you can\'t guarantee a specific response. Yeah, that\'s exactly it. It is still a randomized response because just by looking at the answer, you can\'t know whether that was the truth. I\'m sorry. You\'re just changing the probabilities in that tree that we had. So that tree of possibilities that you have, you\'re just changing the probabilities of these branches of that tree. So is that one sentence enough to guarantee the five marks in this question? So what I would pick, good question. So what I\'m looking for is, first of all, I\'m looking for the answer, yes. And then I\'m looking for an explanation of the yes and he said that, what did you say sorry? A specific outcome can\'t be confirmed. I can say like a deterministic outcome isn\'t confirmed, it\'s still random. Yeah, that\'s acceptable. So if he said deterministic outcome cannot be confirmed, So that\'s an acceptable answer. This is fine. But you could also say that you still have plausible deniability, meaning that the answer is not incriminating because it can be the true answer. It cannot be. We don\'t have a way of determining what that translates. Yeah? The.99 bias still be considered randomized by the response? What do you think? Yeah, maybe it\'s still considered randomized, but that level of privacy is a lot less though, right? Because you would suddenly have a lot of answers that are skewed towards one of those, zero, one. So you could probably try you know guess that okay there\'s something wrong with one hand. Actually more than the privacy your accuracy is completely shocked because you\'re gonna have a lot of enormous size in your answers so they\'re going to be very far from the two answers in the aggregate. Is it a fact that both the responding and the survey makers know about this bias important? Well it\'s important for the survey maker when the survey maker then uses the number of responses they got to estimate the true value. So they need to know with what probability that protocol happened. But for this particular answer, maybe that information that they both know it is not important because it is still randomized. Let me correct that. That information is needed because if the surveymaker didn\'t know what those probabilities were in that protocol, then the surveymaker cannot reconstruct what the real answer should be, right? In which case, it\'s not a true randomized response method because I\'m not getting responses I can make any use of. Use those responses. Okay, well that was fast. Any questions? What is the material covered for test month? What material is covered? Yeah, so I think on Tuesday I said that everything until last Thursday is covered. Okay, so these were just sample questions. And the real test will have more questions. And I said that this, the amount of questions, this one is, you know, about a half of what the real test would have. It\'s hard to say because some of them are two points and are very quick, some of them are two points that take you time to do, so it\'s hard for me to give you an exact number of that, you know, roughly half. What\'s the time limit for the test? Fifty minutes. Limit for the task? 50 minutes. Okay. Alright, so I finished that quickly. Should I go back to the continue the lectures? Or do you want like an extra 20 minutes off? Yeah. I have one final question. Yeah. What can the formula sheet do? Formulas for distance, but we don\'t know how to calculate given a data set, how to find a sort of specific values to differentiate. We find the difference between the two values. You don\'t? Like, if given a data set, how would I convert that into a number? Yeah. So let\'s look at this one. This is an elder version. How would you do that? The first thing you need for a distribution is the support of the distribution, right? Like what are the possible values? That\'s what you need to know first. And this is pay. I don\'t, you know, I didn\'t put any information here about all possible pay grades of this company. I don\'t know that. But I know what all the So my range, let\'s say, of the page A, all the potential, like the outputs here are coming from this discrete set. So I have one, two, three, four, five. I have up to five pay grades, right? So Q, if this is the distribution of the whole table, then I can construct this because I have the values here, right? The distribution for the whole table will be, so the number of ones I have is one, two, so that\'s two out of how many people are in this table? Fourteen. So that\'s two out of fourteen. And then for pay grade 2, I have 1, 2, 3 people. So that\'s 3 out three out of 14, okay grade three, one, two, three, out of 14, okay grade four, one, two, three, out of 14. Pay grade five, I\'ve got two, three, and four. So first of all, did I get this right? Does this add up to one? I\'ve got two plus three, five, eight, 10, 14. Yeah, this adds up to 1. So this is the distribution of the pay grade over the entire table. That\'s the number of people in this table. So it\'s a probability of frequency. So it\'s frequency of that pay grade in this table. So there\'s two people out of 14 that have a pay grade of one. OK. So that\'s Q. That\'s the distribution over the whole table. Now what do you want to do? You want to look at P. So you take a particular pointless class and then we\'ll do P on that one. On that one. Let\'s say I\'m looking at equivalence group 5. So I\'m going to calculate the P for equivalence group 5. And that\'s this group here. I\'ve got nobody that has a pay grade of 1. Nobody that has a pay grade of 1. I don\'t need the 14, I guess. Two, nobody that has a pay grade of 3. One person has a pay grade of 4 and two have a k grade of 5. So that\'s the distribution over that particular equivalence proof. And so now you have P and Q, and you can put that into that distance formula. Does that make sense? Yeah. So for all of them formulas, when you were covering this within class, you said that you were just talking about it just to give us feedback about what the left one meant and stuff. You said that, if we could just see. Yeah, I said that for the Earth Mover distance measure. I didn\'t say that for those two that are on the formula. If you look at the annotated slides, part where it says not on the test is one particular slide. And that slide talks about the Earth Mover distance. I\'m not sure if that\'s legal or not. Talks about the earth when we\'re just\... I\'m not sure if it\'s a legal term of this, but P5 does not add up to 1, so do I normalize? Oh, good point, you\'re right. It doesn\'t have to be one, I made a mistake. Yes, you\'re absolutely right, I made a mistake. It has to be a distribution, so you checked to see if it was a distribution and it wasn\'t correct. Okay, so pay grade of one, there\'s zero people, pay grade of two, there\'s zero people, three, there\'s zero, pay grade of four, I\'ve got one out of the three, and There\'s three people in this equivalence rule, right? So the distribution is only over this equivalence rule. So I\'ve got one out of three, and then for pay grade five, I have two out of three. Yes. Correct. Yes. Correct. So if I were to ask you a question like this, here\'s the P for this. What\'s the distribution for this equivalence class and what\'s the distribution for the whole table? What would be the purpose of that question? Why would I ask you that? What are you going to do by calculating the P of its equivalence class and the Q of the whole table? What are you going to use that for? Yeah. Can you maybe use that to find the problem that I\'m going to get someone in that group has a certain pay grade? Yeah, so this distribution does give you that information. It gives you a probability of a yes. That\'s true. T closeness? Yeah, you could use that to calculate the t closeness right because you\'re looking at the distribution of the feature values and the distribution of the table to look at how close they are and what do you want do you want them to be close or Do you want them to be close or do you want them to be far? What\'s the objective? You want them to be close because the point of the t-closeness was that the distribution was skewed in some way. So you want them to be as close as possible. And that\'s why we looked at those distance measures, because that tells you how close these two are. So you use that distance measure, let\'s say, so let\'s say, let\'s say you\'re going to use the total variation distance. And it\'s pretty much like the difference between the two values. Absolute value, the difference between the two values, equals half of the constant. So, Yeah, you\'re right. So then what do we have here? We have zero, zero. So we have two, 14, three, 14, three, 14. And then I have 1 third minus 2, 14. And then I have 2 thirds minus 4, 14. Right? That\'s the variation of the difference. So calculate that with the value of t closest. I\'m not going to ask you things where you have to calculate stuff and you need a calculator or something. Yes. This tells us the t closeness with respect to tq1x class 5. Do I need to repeat this for more? The t closeness value for the table is the minimum of the\... I think you just asked me how would you calculate the P and the Q, right? That\'s what you just asked. The question asked was, okay, we have these distance measures that are in the formula sheet. What am I going to use this distance measure for? Right? And you\'re using it to determine that you close this, for which you need the distribution Q over the table and the distribution Q over the table. Okay. There are no more questions. There are no more questions. Yes. Yeah, so they\'re measuring, they\'re both measuring the distance between two probability distributions. Well technically that\'s not true, the second one is not true. They\'re both measuring how close two probability distributions are. The first one is just a simple one. It\'s just looking at the individual probabilities of the possible and the differences. It\'s a simple measure. But when you talk about probability distributions, that may not really measure the same notion of how close the two distributions are as some other measures. The other measure, the pullback, the lever, the convergence measure, That\'s looking at, you take a reference, you take one of those distributions under reference, and then you see how different the other one is from that graph. So you\'re looking at the difference for each output value, But now you\'re scaling it with the value of one of those distributions. So it kind of gives you like that scale, appropriately scaled measure of it. Okay? Like that first one is just the difference, that\'s it. The other one is scaled according to one of those distributions. And in some cases, you might be interested in that because you\'re not just interested in just overall difference. You\'re interested in how far from this distribution other one is. So you would reference that one. So depending on the context, they covered different things. OK, so I just wanted to quickly bring up what I covered as the review So that you are reminded of what\'s on the test that we started with that data science life cycle, all of that material that went into discussing what each component of the life cycle is. And then we jumped into privacy. We started with anonymization, the type of attacks that are possible with just a simple anonymized dataset, KN anonymity, which tried to address some of those issues, but then had other limitations, other types of attacks that are possible, and how can an enemy is achieved. And to improve on that, we had L diversity, which tried to make the protective value, the features in the protective values, diverse enough to avoid, like, a homogeneity attack. But there were other attacks possible, like the systematic skewness. Still be diverse, but of course, diversely skewed to one side of the distribution, then that\'s some information. And so then the t closeness was looking at how that distribution of the feature values, how spread out it is, because the skewness was a factor here, so now we\'re looking at how to fill that. And the way to do that is to compare the distribution of the equivalence class with the distribution over the whole table. Can you explain semantic data across? Yeah, the semantic one was basically, there are sometimes your feature values are not numbers, but their actual categorical, it might be a categorical feature. And if it\'s a categorical feature, it likely has some meaning behind the feature value. So the example we looked at was disease and there\'s a meaning behind the gastric ulcer, the other abdominal issues and so on. So it might be semantically close and then you know, oh, okay, this person has abdominal problems. Which is also a version of this skewness. Okay, And so that\'s where we use this P and Q. Somebody over there asked about that to find the close-up should be. Then we looked at, OK, let\'s not just share the whole dataset. Let\'s just share aggregate statistics. But there were problems with reconstruction attacks because you could write down a set of equations based on the query responses you\'re getting and then find a table that satisfies those conditions. Also if you just shared parameters of this machine learning model, they were trained on that data set, you still had the risk of a membership inference attack. And so just plainly sharing this information is not sufficient. We need to do some kind of a randomization or obfuscation of the true values so that they\'re not recoverable. So the true values are not recoverable, but you still have an aggregate utility of that whole system. And so we looked at for binary values, we looked at randomized response of just for the opinion of the committee and office getting your answer in that way. And we saw that that led to this notion of plausible deniability, which is a notion of privacy that people liked. And that led to this definition of differential privacy, which essentially says that you can\'t determine if somebody was in that training set or not. So there\'s that plausible deniability. Maybe 90% of these people cheated on their taxes, but you don\'t know if I was in that data sense, but you don\'t know with what probability I cheated on my tax. So that\'s kind of the idea captured by this notion of differential privacy, which had a very specific definition. And three different properties that we might want, that this definition does satisfy. Okay, so that\'s it. I\'ll just keep you in the monitor. Yes, we went Friday. So, yes, that\'s it. There\'s no need to put your monitor in. Yes, you are Friday.

Use Quizgecko on...
Browser
Browser