Ethics and Modern AI Lecture 3: Machine Learning I PDF

Prof. Dr. rer. nat. Anne Lauscher Ethics and Modern AI Lecture 3: Machine Learning I Trigger warning: this presentation contains content which might be offensive to some listeners. Organizational Notes Verhalten im Brandfall Brandvermeidung: Was sollte man im (Uni-) Gebäude NICHT machen? NICHT IM GEBÄUDE RAUCHEN! KEIN (LAGER-)FEUER LEGEN! BRENNBARE STOFFE RICHTIG ENTSORGEN! Wie verhalte ich mich im Brandfall? SO ! ! ! T H C I N Wichtigste Regel: RUHE BEWAHREN!!! Brand melden: 1) Falls möglich über einen Handmelder Brand melden: 2) Notruf an die Feuerwehr 112 Aber was sag ich denn da? L 1. Wo brennt es? - Straße, Hausnummer, Stadtteil, Gebäude, Stockwerk, Raumnummer 2. Was brennt? - Brandart, Brandursache 3. Wie viel brennt? - Umfang des Brandes 4. Welche Gefahren? - Nähere Angaben (z.B. durch Gefahrenstoffe) 5. Warten auf Rückfragen! - Das Gespräch beendet die Notrufzentrale! Brand melden: 3) Das Servicepersonal informieren! Brand- und Rauchausbreitung vermeiden Falls möglich! Es gilt immer: NICHT IN GEFAHR BEGEBEN! Flucht- und Rettungswege 1. Schaut euch nach dem nächstmöglichen Fluchtweg um und haltet ihn frei! 2. Aufzüge nicht benutzen! 3. Nach verlassen des Gebäudes wartet an einer Sammelstelle auf die Feuerwehr! 4. Warnt auch andere Menschen denen ihr begegnet! Löschversuch unternehmen Feuerlöscher benutzen Löschschlauch benutzen Mittel und Geräte zur Brandbekämpfung benutzen (z.B. Löschsand) Es gilt immer! Spielt nicht die Held:in! Sicherheit geht vor! Recap Chris asks Apple’s Siri Issues? • • • • • Fairness and Trust Privacy and Data Protection Dual Use and Misuse Transparency Environmental Aspects (Normative) Ethics A philosophical discipline that studies morality (the right and wrong) • Normative • not exclusively about what is the case • instead, about what should or ought to be • In other words: prescriptive • Reasons and normative explanations -- tries to systematically explain why you ought or ought not to do something • Deals with human functions or properties: actions, intentions, beliefs • Aim: Develop a theory of what is right or wrong Areas • • • Meta-ethics • • What is the nature of ethical theories & moral judgements? Deals with issues about ethics • • How can we determine a moral course of action? Ethical Theories, e.g., Virtue Ethics Normative ethics Applied ethics • • • What should we do in a specific situation? E.g., Machine Ethics (Descriptive ethics) • • What are Individuals’ moral beliefs? Empirical Study Moral Relativism vs. Moral Universalism Cultures and times are extremely diverse So, there are no universal moral standards Cultures and times are diverse Still, there are universal moral standards Torturing innocent persons is wrong. True? Wrong? That’s relative! Individual level, cultural level Torturing innocent persons is wrong. True? Wrong? Either true or wrong But if morals are relative … how can we do normative ethics? Possible solution: there are degrees of universality https://upload.wikime dia.org/wikipedia/co mmons/thumb/9/9b/ El_sacrificio_de_Isa ac_%28Domenichin o%29.jpg/1280pxEl_sacrificio_de_Isa ac_%28Domenichin o%29.jpg Domenichino Ethical Theories ”Frameworks” for ethical reasoning Divine Command Theory Natural Law Theory Virtue Ethics Consequentialism: Simple Egoism, Utilitarism Deontological Ethics: Kantianism Ethics of AI • Moral behavior of humans as they design, make, use and treat artificially intelligent systems • Concern with the behavior of machines (machine ethics) • Deals with issue of a possible singularity due to superintelligent AI. Isaac Asimov 1942 short story "Runaround” The three laws of robotics AI Ethics Guidelines 84 ethics guidelines for AI, 11 clusters of principles were found transparency, justice and fairness, non-maleficence, responsibility, privacy, beneficence, freedom and autonomy, trust, sustainability, dignity, solidarity. Jobin, Anna; Ienca, Marcello; Vayena, Effy (2 September 2020). "The global landscape of AI ethics guidelines". Nature. 1 (9): 389–399. arXiv:1906.11668. doi:10.1038/s42256-019-0088-2. S2CID 201827642. Ethics Washing? the practice of feigning ethical consideration to improve how a person or organization is perceived Questions? After this session, you will … • Know what Machine Learning (ML) is & which different types of ML exist • Understand the principles behind supervised learning, in particular, classification • Be able to apply an example classification algorithm, k-NN • Understand how to evaluate classification algorithms • Understand how to obtain data for training and evaluating classification Learning goals Currently, MOST AI Systems are based on Machine Learning! https://miro.medium.com/v2/resize:fit:720/format:webp/1*hXK4F_vFtGfh2BrxDolFg.jpeg Modern AI is mainly based on Machine Learning Data, Examples Algorithms and Methods for discovering - Patterns Dependencies Hidden Structures Relationships Model Types of Machine Learning − Supervised − − − Unsupervised – Discover patterns in sets of unlabeled data (e.g. documents with similar topics) Self-supervised Learning − − Learn function (e.g. spam filter) from manually labeled training examples (e.g. spam and non-spam emails) Supervision is provided by the method itself Reinforcement learning − Uses cumulative reward/punishment as a form of indirect supervision Supervised Machine Learning: Classification https://i.redd.it/8lfied3ohyp11.jpg Supervised Machine Learning: Classification training data (labeled examples) ( <violet, 6>, ? ) A A ? B A B B representation (feature vectors) ( <yellow, 0>, A ) ( <red, 4>, A ) ( <green, 4>, A ) ( <violet, 3>, B ) ( <blue, 4>, B ) ( <red, 5>, B ) test data (unlabeled examples) Edges even odd Color not blue blue A classifier B labels for test data A ( <violet, 6>, A ) Supervised Machine Learning: Classification Given: • A training data set D of labeled instances (or examples) with each labeled instance: ⟨xi,ci⟩ ∈ D ⊆ X×C, where • Each x ∈ X is an example or instance • X is the instance language or instance space. • Each instance is represented as an n-ary feature vector • Each c ∈ C is a target label from a set of classes: C = {c1, …, cj} Goal: • Learn from training data a classifier f: X → C that maps an instance x to a class f(x) ∈ C Supervised Machine Learning: Regression Given: • A training data set D of labeled instances (or examples) with each labeled instance: ⟨xi,yi⟩ ∈ D ⊆ X× , where • Each x ∈ X is an example or instance • X is the instance language or instance space. • Each instance is represented as an n-ary feature vector • Each y ∈ is a continuous target label Goal: • Learn from training data a regressor f: X → that maps an instance x to a class f(x) ∈ Text Categorization Which language?! • English? • German? • Spanish? Text Categorization Which topic ?! • Politics? • Sports? Spam Filtering From: "" <[email protected]> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! ================================================= Click Below to order: http://www.wholesaledaily.com/sales/nmd.htm ================================================= K-Nearest Neighbors (K-NN) Algorithm k Nearest Neighbors (kNN) Given an instance x and the training data X − Identify the set Xk of the k nearest neighbors of x, i.e., the k training instances that are closest to x − For each class c ∈ C • Compute N(Xk, c), the number of Xk members that belong to class c • Estimate probability P(c|x) as N(Xk, c) / k − Classify x to the majority class of Xk members kNN – Example 5NN (k=5) c( ) = ? Government Science Arts n-ary classification problem! kNN for Text Classification − We can use kNN to classify documents: this is an example of Vector Space Classification − Vector Space Model (VSM): each document is a vector, one component for each term (= word) − High-dimensional vector space: • Terms are axes • Docs are vectors in this space Binary term-document incidence matrix Each document is represented by a binary vector ∈ {0,1}|V| Term-document count matrices Each document is represented by a count vector in ℕv tf-idf weighting • The tf-idf weight of a term is the product of its tf weight and its idf weight. − Best known weighting scheme in information retrieval • • Note: the “-” in tf-idf is a hyphen, not a minus sign! Alternative names: tf.idf, tf x idf − Increases with the number of occurrences within a document − Increases with the rarity of the term in the collection Sec. 6.3 Binary → count → weight matrix Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V| Sec.14.1 Classification Using Vector Spaces − The training data consist of a set of documents, each labeled with its class (e.g., topic) − In vector space classification, this set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space − Premise 1: Documents in the same class form a contiguous region of space (contiguity hypothesis) − Premise 2: Documents from different classes don’t overlap (much) − Goal of the learning algorithm is to define surfaces (decision boundaries) to delineate classes in the space 46 Documents in a Vector Space Government Science Arts Test Document of What Class? Government Science Arts Test Document = Government Q: How can we use the labeled data to find good separators? Government Science kNN partitions the instance space into cells (Vornoi cells) enclosing the points labelled as belonging to the same class Arts Nearest-Neighbor learning algorithm − Training algorithm • For each training example ⟨xi,ci⟩ add the example to the list − Classification algorithm • Given a test instance x to be classified • Let Xk = {⟨x1,c1⟩, …, ⟨xk,ck⟩} be the k instances which are nearest to x • Where δ(a,b)=1 if a=b, else δ(a,b)= 0 (Kronecker function) Nearest-Neighbor algorithm: pros and cons − Learning / training is simply storing the representations of the training examples (also called memory-based or lazy learning). − Q1: Why is this good? − Testing requires to compute similarity between an instance and all examples in the dataset. − Q2: Why is this bad? − Q3: How can we compute similarity? Similarity metrics − The standard similarity for document vectors is the cosine similarity of tf.idf weighted vectors − We can then define distance on the basis of similarity Illustration of 3 Nearest Neighbor for Text Vector Space 53 Choosing the k − Pick k not too large, but not too small (depends on data) Evaluate for k ∈ {1, 3, 5, … } on a development set − Find k s.t. performance is maximized (or error rate is minimum) − k is kept odd so as to avoid ties − kNN: Discussion Key ideas: • Similar examples have similar label. • Classify new examples like similar training examples. ⟹ No training necessary Open questions: • How to determine similarity? ⟹ The similarity measure must “match” the target function • How many similar training examples to consider? ⟹ Increase k: makes kNN less sensitive to noise ⟹ Decrease k: allows capturing finer structure of space Decision Trees What is a Decision Tree? A predictive model based on a hierarchy of Boolean tests Each internal node tests an attribute Each branch corresponds to an attribute value Outlook Sunny Humidity High No Overcast Rainy Yes Wind Normal Strong Weak Yes No Yes Each leaf assigns a label Classifying unseen instances “Play Tennis?” Outlook sunny rain overcast Humidity high No Yes normal Yes Wind strong No weak Yes Break: 5 minutes Evaluating Classification Classifier testing We are ultimately interested in how the classifier works on new, previously unseen instances/examples In other words, the classifier must be able to generalize To test how well the classifier generalizes, we split the labeled test set into a training set and a test set (held-out data) The classifier does NOT use the test set for training Once trained, we run and evaluate the classifier on the test set, the error on the test set is called generalization error Generally, the performance on the test set will be worse than on the training set, but this is a much more realistic performance estimate Classifier evaluation To measure how well a classifier will work on unseen data, we have to evaluate it on the test set Standard evaluation measures: – Accuracy – Precision, Recall, F-score The classifier is often compared against a baseline (a simple method that can easily be implemented). Typical baselines: – – – – Majority class classifier (MCC) Random classifier A very simple rule-based classifier A very stripped-down version of the real classifier To prove that one classifier is better than the other or the baseline, we have to perform statistical significance tests Confusion Matrix !"#$%&'#$()*+,, -&'.+*()*+,, /#, 01 /#, 2".#(!1,%'%3#, 4+*,#(0#5+'%3#, 01 4+*,#(!1,%'%3#, 2".#(0#5+'%3#, Given a target class: - Each instance is either positive or negative (the algorithm classifies the data point as belonging to the class or not) . . . - . . . and either “true” (correct) or “false” (incorrect): the classification decision is correct or incorrect. Evaluation measures for classification: summary Accuracy – Q: Accuracy is not to be used if class distribution is skewed (too many negative examples). Why? Precision (P), Recall (R), and F1-score – If classification of positive instances (y = 1) is what we’re interested in Evaluation measures for multi-class classification If we have K > 2 classes, we obtain a K x K confusion matrix E.g., for K = 3 (rows = predicted labels, columns = actual labels) We can derive 2-way confusion matrices for each class: Two options: micro and macro measure aggregation Macro measures Compute Acc, P, R, and F1 for each class separately: Average measures across all classes: If on one of the classes the classifier performs very poorly, this can have a big effect on the average score Micro measures Obtain global tp, fp, fn, and tn counts by summing 2-way confusion matrices for all classes From the aggregated 2-way confusion matrix we can compute the measures in a usual way Note: – Always fp = fn, thus micro P, R, and F1 are the same – Micro and macro accuracy are equal – Commonly micro F1 > macro F1 because classifiers tend to fail on classes with fewer instances and such classes impact micro average less !"#$%&'()$*+),,,,,!*#$%&'()$*+) P|C| Pmicro = P|C| i=1 i=1 T Pi T Pi + F Pi !&60+-$'."33.&7.#5$.+6-+,+-1"3.-$!+0+&60 P|C| Rmicro = P|C| i=1 i=1 T Pi T Pi + F Ni +60#"6!$)*+,&#$-./$"01'$ %+,$0.$21"3.4$+%5#.#&.$,$'(.+60#"6!$ Pmacro |C| X T Pi 1 = |C| i=1 T Pi + F Pi ",$'"%$.*'$!+0+&6."6-.'$!"33.*$'.!3"00 Rmacro |C| X T Pi 1 = |C| i=1 T Pi + F Ni !"#$%&'()*+,&#$-./$"01'$ %+,$0.$21"3.4$+%5#.#&.$,$'(.!"#$%&'( Data Annotation & Agreement Dataset annotation We often need to manually label the data for model training and evaluation. We call this the process of data annotation – E.g., label SPAM, rate how similar two words are The crucial question are the annotations correct / do they make sense? Often the task is subjective and there is no “ground truth” Instead of measuring correctness, we measure annotation reliability: do humans consistently make the same decisions? – Assumption: high reliability implies validity Reliability is measured via inter-annotator agreement (IAA) Inter-annotator agreement (IAA) Have two or more annotators annotate the same data – Independently – But according to the same guidelines Measure the inter-annotator agreement (IAA): – – – – Agreement % Cohen’s Kappa Fleiss’ Kappa P, R, F1 (results of one annotator taken as true labels) Example 1: labeling relevant docs for relevance dataset Agreement is 92.5% Number of docs Judge 1 Judge 2 Is this good (enough)? 300 Relevant Relevant 70 Nonrelevant Nonrelevant 20 Relevant 10 Nonrelevant Relevant Nonrelevant Example 2: word relatedness judgements Here, Agreement is 70% instead Again, is this good (enough)? Note that this time coders agree on 6 „No” and only 1 „Yes” Agreement Suppose annotators are notoriously uninterested and label instances at random Q: How much agreement can we expect in this case? – Both choose „Yes”: 0.5 * 0.5 = 0.25 – Both choose „No”: 0.5 * 0.5 = 0.25 è 50% chance agreement! Q: Suppose both annotators label 95% of instances as „No”. What is the chance agreement now? – Both choose „Yes”: 0.05 · 0.05 = 0.0025 – Both choose „No”: 0.95 · 0.95 = 0.9025 è 90.5% chance agreement! Obviously, IAA measures must be corrected for chance agreement Cohen’s Kappa Cohen’s Kappa computes the chance-corrected IAA Ao – observed agreement; Ae – agreement expected by chance These agreements are computed from the confusion matrix Observed agreement: Ao = p11 + p22 Expected agreement: Ae = p1.p.1 + p2.p.2 Kappa is in the range [-1, 1] (0 is chance agreement, 1 perfect) Cohen’s kappa: labeling relevant docs for IR dataset Ao = p11 + p22 A1 \ A2 Relevant Not relevant Relevant 300 20 Not relevant 70 10 300 70 370 = + = = 0.925 400 400 400 Ae = p1· p·1 + p2· p·2 = 320 310 80 90 = · + · = 0.665 400 400 400 400 <latexit sha1_base64="BpBRItQ2+uJGvuVrXPPvJ9IQHZg=">AAADNHicbVJNaxQxGM6MX3X92tajl+BiEcQlGetuexBavYinCm5b2FmGTDbThs1MQpIpLMP8KC/+EC8ieFDEq7/BzEfH7baBCQ/P8+R9n7yTWAluLELfPf/GzVu372zc7d27/+Dho/7m1pGRuaZsQqWQ+iQmhgmesYnlVrATpRlJY8GO48W7Sj8+Z9pwmX2yS8VmKTnNeMIpsY6KNr0PB5GE2/ANVFGBcQlfVCAISseEiSa0eIVQWey4zUkNM74gOssKg4Z7wesw7G1Dtx1ErKsNQzqXtnSwBrDrtS40vasKXf3gon7jaFm8Hmz3OtvearbRqMsWLohSBP7vUg3iJXSRSxe2AZdS1DdzQl2l8TSwrjwej6L+AA1RveBVgFswAO06jPpfw7mkecoySwUxZoqRsrOCaMupYGUvzA1ThC7IKZs6mJGUmVlR//QSPnPMHCZSuy+zsGZXTxQkNWaZxs6ZEntm1rWKvE6b5jbZnRU8U7llGW0aJbmAVsLqBcE514xasXSAUM1dVkjPiJuQde+s54aA1698FRwFQ4yG+OPOYP9tO44N8AQ8Bc8BBmOwD96DQzAB1PvsffN+er/8L/4P/7f/p7H6XnvmMbi0/L//AOAU8Kc=</latexit> Ao Ae = 1 Ae 0.925 0.665 = = 0.776 1 0.665 Cohen’s kappa: word relatedness judgements Cohen’s kappa – interpretation There is no agreed-upon interpretation Krippendorff (1980): – κ < 0.67 – discard – κ ≥ 0.67 – tentative agreement – κ ≥ 0.8 – good agreement Landis and Koch (1997): – – – – – κ < 0.2 – slight κ ≥ 0.2 – fair κ ≥ 0.4 – moderate κ ≥ 0.6 – substantial κ ≥ 0.8 – perfect Cohen’s kappa: extensions Multiple classes – Use a multi-way confusion matrix Weighted kappa – If not all errors are equally severe, we can assign weights to the different types of errors – E.g., annotating word relatedness with yes, maybe, and no: wyes,no = 1, wyes,maybe = 0.5, wmaybe,no = 0.5 Multiple annotators – Option 1: Compute pairwise Cohen’s kappa’s and then average – Option 2: Use Fleiss’ kappa On-line kappa calculator: http://vassarstats.net/kappa.html Causes of low IAA 1. 2. 3. 4. Random errors (slips) Misinterpretation of annotation guidelines Different intuitions The task is subjective 1. Objective tasks: POS tagging, parsing, NER, etc. 2. Subjective tasks: semantic tasks, sentiment analysis, etc. Ideally, we wish to eliminate the first three causes The remaining disagreement is due to subjectivity and indicates the intrinsic difficulty of the task IAA and ML IAA defines the upper bound (topline) of a ML method – the baseline defines the lower bound Model trained on low-quality annotations will not work well – The „Garbage-In-Garbage-Out” effect – ML model will learn to make the same mistakes, or will simply be confused due to inconsistent labels (so-called „teacher noise”) If IAA is too low, you should revise/aggregate the annotations – Enforce consensus or resolve disagreements by a third party (good for objective tasks) – Average over annotations (good for subjective tasks) The revised dataset is called the gold (or „ground truth”) set Typical data annotation workflow 1. 2. 3. 4. Prepare the dataset and split it into – A calibration set and – A production set Define annotation guidelines Calibration (iterate until IAA is satisfactory) a) Annotators independently annotate the calibration set b) Compute the IAA c) Discuss the disagreements and revise guidelines (if necessary) Production a) Annotators independently annotate the production set (different portions) b) If portions overlap, compute the IAA (this is the IAA to report) c) Obtain the gold standard by aggregation, resolving, or consensus Questions? Now, you … • Know what Machine Learning (ML) is & which different types of ML exist • Understand the principles behind supervised learning, in particular, classification • Are able to apply an example classification algorithm, k-NN • Understand how to evaluate classification algorithms • Understand how to obtain data for training and evaluating classification Learning goals Now, you … • Know what Machine Learning (ML) is & which different types of ML exist • Understand the principles behind supervised learning, in particular, classification • Are able to apply an example classification algorithm, k-NN • Understand how to evaluate classification algorithms • Understand how to obtain data for training and evaluating classification Next: Machine Learning II

Ethics and Modern AI Lecture 3: Machine Learning I PDF

Document Details

Tags

Related

Summary

Full Transcript