Trust & Safety Guest Lecture Notes PDF

Trust & Safety ENGR 182 F23 - 10/18/2023 Alessia Zornetta S.J.D. Candidate – UCLA Law What is Trust & Safety? “ Trust and safety is the study of how people abuse the internet to cause real human harm, often using [online] products the way they are designed to work ” - For example, if someone is sending a threatening message through a messaging app, they are using the product as intended but they are still causing harm Trust and Safety is also a practice and a field within technology companies that is concerned with the reduction, prevention, and mitigation of online harms. Per the Trust & Safety Professional Association: “As internet communities, online services, and the use of digital technologies to mediate our daily lives and interactions have continued to grow, technology companies have needed to determine the kinds of content and behaviors that are appropriate and those that are not. The teams that handle this responsibility often fall under the general term “trust and safety.” There are four main factors that drive trust & safety: 1) Corporate responsibility 2) Crisis sensitivity: many T&S departments were born as a response to something bad happening. Companies start to notice problems, like lots of crypto scammers contacting their users, fake reviews directing users off the platform to hand over their credentials, etc and realize they need to take measures to maintain users safe and satisfied with their product 3) Regulation, regulatory pressure (e.g., EU DSA, UK Online Safety Act, Australia SbD Framework): more and more laws around the world demand that online platforms of a certain size establish trust and safety teams 4) Technological standards applied through the stack (e.g., Apple’s app rules) Taxonomy Companies have a wide variety of policies, and nearly as many different ways of classifying and grouping those policies. Despite this surface-level diversity, common themes can be found across the industry since underlying human misbehaviors transcend any company or product. Violent & Criminal Behavior Dangerous Organizations: presence of criminal/dangerous groups and their members, and content/behavior that supports such groups Violence: directly threatening, supporting, or enabling acts of physical violence Child Abuse & Nudity: depicting or engaging in the abuse or exploitation of children Sexual Exploitation: depicting, threatening, or enabling sexual violence or exploitation Human Exploitation: engaging in or enabling the sale, enslavement, or coercion of people into laborious, dangerous, or illegal actions Regulated Goods & Services Regulated Goods: Sale/trade of goods which are generally regulated or banned Regulated Services: Sale/trade of services which are generally regulated or banned Commercial Sexual Activity: Depicting, soliciting, or offering sex acts or adult nudity in exchange for money Offensive & Objectionable Content Hateful Content: expressing or encouraging hatred, contempt, discrimination, or violence against specific groups based on “protected classes” which includes factors such as race, religion, gender, or sexual orientation Graphic & Violent Content: depictions of deaths, injuries, and other violent acts that are likely to shock, upset, or offend Nudity & Sexual Activity: depicting, soliciting, or offering non-commercial sex acts or adult nudity User Safety Suicide and Self-Harm: content in which a person is harming or threatening to harm themselves or enabling/encouraging others to do so Harassment and Bullying: intimidating, degrading, or humiliating a specific individual or group of people Dangerous Misinformation and Endangerment: content/behavior that might cause a user to unintentionally cause harm to themselves or others Hateful conduct & slurs directed at a particular person Scaled Abuse Spam: unsolicited and unwanted content/messaging, particularly commercial advertising, generated at scale Malware: links to malicious software that can negatively affect users Inauthentic Behavior: using fake accounts to deceive or manipulate users Deceptive & Fraudulent Behavior Fraud: attempting to wrongfully or criminally deceive for financial benefit, or encouraging / supporting others to do so Impersonation: taking over the identity of another user or group Cybersecurity: attempting to compromise accounts or other sensitive information Intellectual Property: using trademarks or copyright protected content without permission Defamation: damaging the good reputation of others Platform-Specific Rules Format: rules on the form of content, often for clarity, organization, or technical limitations ○ Word limits, restrictions on links or shared files, insufficient details Content Limitations: rules on where and how certain topics can be discussed ○ Off-topic content ○ Restrictions on selling/advertising ○ Spoilers ○ Trigger warnings A Brief Overview of T&S - 1999: one of the first companies to use the term T&S was eBay in 1999 during a press release. - Soon after, in 2002, eBay formed the “Rules, Trust and Safety” team aimed at identifying fraudulent activity before the transaction between buyers and sellers. Around the same time, academics picked up the term in a conference article - Over the years, companies started addressing “trust and safety” matters within already existing departments like “operations” “legal” and “information security/cybersecurity”. - Today, T&S teams across the tech industry have different scopes, missions and organization structures. - As an academic topic, T&S shares borders with internet governance, internet policy, platform governance, disinformation etc From Twitter to X Content Moderation From Free Speech Absolutism… Title or Position Website administrators could not filter incoming messages Challenging to remove posted messages Belief in unfettered expression CompuServe, Prodigy and the moderators’ dilemma? In 1995, a New York state court in Stratton Oakmont, Inc. v. Prodigy Services Co., found the popular online service, Prodigy, liable for the defamatory material that was posted to their “Money Talk” bulletin board. In the interest of maintaining a “family-friendly” service, Prodigy regularly engaged in content moderation, attempting to screen and remove offensive content. But because Prodigy exercised editorial control – like their print and broadcast counterparts – they were liable as publishers of the defamatory content. The Prodigy decision came several years after a New York federal district court in Cubby, Inc. v. CompuServe Inc. dismissed a similar defamation suit against CompuServe – another popular, competing online service from the 90’s. Similar to Prodigy, CompuServe was sued for defamatory content published in its third-party newsletter, “Rumorville.” Unlike Prodigy, however, CompuServe employees did not engage in any moderation practices, such as pre-screening. The district court rewarded CompuServe’s hands-off approach, holding that CompuServe, could not be liable as a mere content distributor. This left online services with two choices: avoid legal liability but at the cost of their users controlling quality; or attempt to clean-up but with the understanding that these services would be liable for anything that slips through the cracks. This “moderator’s dilemma” was what Section 230 was enacted to resolve. …to Moderation Rise of Hate Speech, Pornography, Threats to User Safety Website owners were forced to have a minimum degree of moderation in order to retain users Now: ○ Calls for greater moderation around illegal content ○ Mixed opinions about legal but harmful content Two Main Approaches to T&S How platforms and interactions are designed greatly impacts how users will interact with one another. Implementing safety by design principles will help to meet regulatory compliance issues. In addition, trust & safety should work directly with product when thinking about the type of interactions users will have on a give platform. The way in which we design spaces can have a major effect on whether or not experiences are overly positive or negative. It’s much easier to show people what to do vs telling them what not to do. As such, prosocial tools are critical in enabling users to engage in a healthy way on your platform. Reactive Responding to user reports Flagging content Proactive Content is uploaded by the user Content is made visible Content is removed Decision is escalated to human moderators AI-based Content Moderation Technologies used in T&S “These tools can be deployed across a range of categories of content and media formats, as well as at different stages of the content lifecycle, to identify, sort, and remove content.” Digital Hash Image Recognition Metadata filtering Natural language processing classifiers https://www.newamerica.org/oti/reports/everything-moderation-analysis-how-internet-platforms-a re-using-artificial-intelligence-moderate-user-generated-content/how-automated-tools-are-used-i n-the-content-moderation-process/ Shortcomings: Circumvention techniques: E.g., for audio and video hashes (by altering the length or encoding format of the file) Biases in training data: E.g., as a result of skewed representation of certain languages and geographic regions Lack of transparency in how databases are populated: But downsides of being too transparent… NLP classifiers struggle with nuance and can be under- or over-inclusive in their coverage Advancements in large-language models (LLMs) and generative AI technology: - New and enhanced risks - Potential for assistance of LLMs in content moderation? Models of CoMo In “Content or Context Moderation?”, Caplan proposes three distinct categories for classification of platforms’ approaches to content moderation. In the years since the article was published, the landscape has evolved, and most major companies now incorporate variants of each of these three processes. Artisanal: in-house on case-by-case basis Community: decisions made by networks or committees of volunteer users Industrial: decisions made by specialized enforcement teams, with automated tools and contractors. Human Moderators → we will learn more about this during out technology & labor class but in general, it is enough to say that moderators’ jobs have a Tremendous mental health impact Moderators are asked to take challenging and complex decisions within seconds There’s always a risk of bias In most cases, moderators receive Inadequate Support both during and after taking up the job GAME: Moderator’s mayhem Human Moderators v AI Automated review is generally much better at detecting problematic content in categories that are clearly defined and where there are many prior examples to match to. Categories that are difficult to define or for humans to agree on and categories that are rare (with few examples to work from) are often challenging for automated approaches. Problematic Content There are multiple types of problematic content, today we will quickly review these four macro categories. Dis-, Misinformation, & Propaganda Harassment & Hate Speech Terrorism, radicalization, and extremism Child Sexual Abuse & Exploitation Dis-, Misinformation, & Propaganda Disinformation: Harmful or destructive content intended to influence an outcome; Misinformation: Information that contradicts or distorts common understandings of verifiable facts; Intent is inadvertent or unintentional. Propaganda: Information that can be true but is used to “disparage opposing view-points”; Systematic dissemination of information for control, especially biased or misleading, promotes a political cause or point of view. Infamous Russian Troll Farm Appears to Be Source of Anti-Ukraine Propaganda — ProPublica Disinformation & genAI image of Pope Francis dressed in puffy, bright white coat went viral, Donald trump being arrested and thrown in jail - Images were not real, produced with genAI - March 2023: new era for mis- and disinformation → technological advance ushers in new dynamics in the production and spread of false misleading content Issues with LLMs - Hallucinations - Scale: LLMs are good at producing texts at will and in ways that favor their dissemination among different audiences - direct access it gives to the users to a personalized, human-like, conversational way of presenting content - Users access LLMs without mediation - possible to foresee a future when the results will be indistinguishable from real pictures to the untrained (or even expert) user What are possible mitigation strategies? Individual-level Improve individual’s ability to identify misinformation ○ E.g., Pre-bunking/inoculation: Bad News Game https://www.goviralgame.com/en Debunking The goal of the game is to expose the tactics and manipulation techniques that are used to mislead people and build up a following. Bad News works as a psychological “vaccine” against disinformation: playing it builds cognitive resistance against common forms of manipulation that you may encounter online. Systemic-Level Algorithms Business Models: BM: ad tech + support reliable media Legislation Terrorism, Radicalization & Extremism Terrorism: There are multiple ways to define terrorism and there is no commonly-agreed upon definition. Overall, in this context, it is enough to say that terrorism is often aimed at generating fear through violent acts. Radicalization: Aimed at “Change in beliefs, feelings, and behaviors in directions that increasingly justify intergroup violence and demand sacrifice in defense of the group” (McCauley and Moskalenko) Extremism: a belief system held together by an unwavering hostility towards a specific “out-group.” How do platforms play a role? Two Main Approaches 1. Reactive Content removal Account suspensions Counter-speech / Counter-activism 2. Proactive Counter-messaging Awareness-raising Education What about AI? Issues: Increased speed and ease of generating extremist content/disinfo Deepfakes Solutions: Train AI models to spot manipulated video Hashing databases (e.g., GIFCT) Child Sexual Abuse & Exploitation U.S. federal law requires that U.S.-based ESPs report instances of apparent child pornography that they become aware of on their systems to NCMEC’s CyberTipline. NCMEC works closely with ESPs on voluntary initiatives that many companies choose to engage in to deter and prevent the proliferation of online child sexual exploitation images. To date, over 1,400 companies are registered to make reports to NCMEC’s CyberTipline and, in addition to making reports, these companies also receive notices from NCMEC about suspected CSAM on their servers. What happens once a company identifies CSAM content? Source: https://www.apple.com/child-safety/pdf/Expanded_Protections_for_Children_Technology_Summ ary.pdf Tech Solutions 1) Apple: Apple’s method of detecting known CSAM is designed with user privacy in mind. Instead of scanning images in the cloud, the system performs on-device matching using a database of known CSAM image hashes provided by NCMEC and other child safety organizations. Apple further transforms this database into an unreadable set of hashes that is securely stored on users’ devices. Tech Solutions 2) Digital Hash & Image Recognition https://www.newamerica.org/oti/reports/everything-moderation-analysis-how-internet-platforms-a re-using-artificial-intelligence-moderate-user-generated-content/how-automated-tools-are-used-i n-the-content-moderation-process/ Digital hash technology works by converting images and videos from an existing database into a grayscale format. It then overlays them onto a grid and assigns each square a numerical value. The designation of a numerical value converts the square into a hash, or digital signature, which remains tied to the image or video and can be used to identify other iterations of the content either during ex-ante moderation or ex-post proactive moderation. Shortcomings: Circumvention techniques: E.g., for audio and video hashes (by altering the length or encoding format of the file) What about unknown CSAM? New CSAM is a growing issue Portion of new content is identified through user reports ○ Or by identifying a chache or network dedicated to CSAM Other new CSAM can be identified through automated stsstems, though not nearly as widespread as hash-matching Difficulty in creating new classifiers: ○ Legal restrictions regarding possession of CSAM: data set of labeled images/video is needed to train a ML model and to test it for accuracy, data labelers would then review images in successive iterations of the model to assess and improve its performance → special arrangements needed to allow developers access to the imagery but still huge risk of trauma for labelers ○ Some of the hesitancy in creating CSAM classifiers for trust & safety applications may also lie in the mismatch between high error rates of classifiers trained in less-than-ideal processes (in comparison to hash matching) and the severity of the consequences for a user who is identified as uploading CSAM Harassment & Hate Speech Harassment & Bullying “interpersonal aggression or offensive behavior(s) that is communicated over the internet or through other electronic media.”(Slaughter & Newsman 2022) “Bullying and harassment happen in many places and come in many different forms from making threats and releasing personally identifiable information to sending threatening messages and making unwanted malicious contact.” (Meta) “Bullying can be defined in many ways, but it typically involves targeted, repeated behavior that intends to cause physical, social and/or psychological harm. The behavior can be carried out online or offline by an individual or a group who misuse their power, or perceived power, over another person or group of people who feel unable to stop it from happening.” (TikTok) Hate Speech (Online) “a kind of speech act that contextually elicits certain detrimental social effects that are typically focused upon subordinated groups in a status hierarchy” (Demaske, 2021, p. 1014; United Nations, 2019, p. 2) Explicit: Directly identifies the target group and uses explicit attacks against it. Implicit: May not directly identify the target group (e.g., “terrorist sympathizers” instead of calling out “Muslims” in Islamaphobic content. Uses implicit language rather than explicit slur or attack (e.g., “cut down the tall trees” has a specific meaning in Rwanda). Sometimes implicit hate speech is context specific (e.g., Hispanics live in sewage is an implicit form of dehumanization). Categories: religious, racist, gender & sexuality There are a few factors that make hate speech online particularly challenging: 1. Anonymity of the speaker.: Even when the speaker is not anonymous, there is less risk to the speaker than when speaking directly in public, especially in the speaker’s community. 2. Mobility and reach.: Speech can be produced in one part of the world and copied and disseminated easily 3. Durability: Sometimes speech can disappear and other times it can live forever, as it can be difficult to track down across platforms. 4. Size of audience: Online hate speech can reach a much wider or more niche audience much more easily than offline speech 5. Ease of access: There is no need to join a group or physically opt to be in a space accepting of hate speech Hate speech is not illegal everywhere. For example, in the US racist, sexist, and other hateful speech online is not a crime. HOWEVER, it can matter in criminal settings to categorize another crime as hate crime. On the other hand, the EU and its Member States have much more stringent laws that make certain speech a crime as of itself. Example: ITALY Art 416 c.p. Criminal Conspiracy Law 645/1952: criminalizes apology of fascism Law 115/2013: criminalizes gender-based violence online (stalking) Art 416 c.p. Criminal Conspiracy 🡪 was used to sanction online hate speech propaganda by neo-Nazi group Law 645/1952: criminalizes apology of fascism 🡪 it is used to contrast fascist movements and contents diffused online Law 115/2013: criminalizes gender-based violence online (stalking) 🡪 introduced legal persecution against persecutory actions carried out through IT and or electronic means Regulation United States: initial leader Initially: §230 and other American laws facilitated the development of the Internet as we are used to it Intermediary liability Intellectual property Privacy Corporate First Amendment §230 (c) Protection for “Good Samaritan” blocking and screening of offensive material (1) Treatment of publisher or speaker No provider or user of an interactive computer service shall be treated as the publisher or speaker of any information provided by another information content provider. (2) Civil liability No provider or user of an interactive computer service shall be held liable on account of— any action voluntarily taken in good faith to restrict access to or availability of material that the provider or user considers to be obscene, lewd, lascivious, filthy, excessively violent, harassing, or otherwise objectionable, whether or not such material is constitutionally protected; or any action taken to enable or make available to information content providers or others the technical means to restrict access to material described in paragraph (1). 🇪🇺 European Union: new trend setter Digital Services Act Objectives: Create a safer online environment Define responsibilities for platforms (marketplaces & social media) Deal with digital challenges Illegal products, hate speech & disinformation Transparent data reporting & oversight The DSA does NOT: Mandate platforms to take down legal content Brussels Effect Companies have an economic incentive to follow EU law Avoid fines (up to 10% global turnover) Lucrative market (400 million population) By being the first economic superpower to pass a tech law reform, the EU gained an incredible advantage: companies do not want to comply with a patchwork of requirements Easier to have one rule for all its users (see Cookie Warnings)

Trust & Safety Guest Lecture Notes PDF

Document Details

Tags

Related

Summary

Full Transcript