Datafication PDF
Document Details
Uploaded by GreatestAstatine1905
University of Fribourg
2024
Philippe Cudré-Mauroux
Tags
Summary
This presentation explores datafication, a technological trend transforming various aspects of life into data. The presentation covers the history of big data, its applications to business and models, and its use in various sectors like research and transportation. It also examines the challenges and opportunities of this data revolution and analyses the 3Vs (Volume, Velocity and Variety).
Full Transcript
Datafication Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland UNIFR October 1, 2024 eXascale Infolab (XI) New lab @ U. of Fribourg–Switzer...
Datafication Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland UNIFR October 1, 2024 eXascale Infolab (XI) New lab @ U. of Fribourg–Switzerland Data Infrastructures for social / scientific / AI applications https://exascale.info/ Big Data & Me My lab @ unifr: eXascale Infolab ( http://exascale.info/ ) – Previously: M.I.T. (Stonebraker’s lab), EPFL (Best PhD Award), U.C. Berkeley – Industry also (IBM Watson Research, HP, Microsoft Research Asia, Microsoft CISL, Scigility, Dashcom) – Teach Big Data at Swiss Joint MSc in CS, Royal Institute of Tech. (Sweden), IIMT, EPFL, HEC Lausanne, HEG Fribourg – National Research Council, responsible for Applied Computer Science →How to store and manage Big Data 2M € ERC, SNF, Haslerstiftung, H2020 Verisign, SAP, Microsoft, Amazon, Google, ArmaSuisse 3 On the Menu Today Datafication! “Datafication is a technological trend turning many aspects of our life into data which is subsequently transferred into information realised as a new form of value.” [wikipedia 09/24] – A brief history of Big Data – Datafication (business) – Datafication (models) – Some thoughs on data + AI in Switzerland (… and abroad) Exascale Data Deluge New data formats Web companies New machines – Google Peta & exa-scale datasets – Ebay Obsolescence of traditional – Yahoo information infrastructures Science – Biology – Astronomy – Remote Sensing Financial services, retail companies governments, etc. © Wired 2009 Tools? I have a database / Python / Excel Michael Stonebraker: Technical perspective - One size fits all: an idea whose time has come and gone. Commun. ACM 51(12): 76 (2008) Big Data Infrastructures Big Data as a New Class of Asset The Age of Big Data (NYTimes Feb. 11, 2012) http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world.html “Welcome to the Age of Big Data. The new megarich of Silicon Valley, first at Google and now Facebook, are masters at harnessing the data of the Web — online searches, posts and messages — with Internet advertising. At the World Economic Forum last month in Davos, Switzerland, Big Data was a marquee topic. A report by the forum, “Big Data, Big Impact,” declared data a new class of economic asset, like currency or gold.” Data is the New Oil Data + Algorithms è Actionable Insight è $$ Big Data / Machine Learning / Model Optimized Data Science “vertical” A.I. (Regression / Services Classification) Data vs. Traditional Assets From data to products and services Raw Stored Curated Information Data Business Data Data Data Blocks Models Use Cases © Scigility AG Use Case or Data Driven Use Case Driven Data Driven Raw Stored Curated Information Data Business Data Data Data Blocks Models Use Cases © Scigility AG Is Data the New Oil? Well, it’s a tad more complex … – Yes, data is the necessary fuel powering current models – Like oil, data needs to be refined to get useful – Unlike oil, data is not fungible (pieces of data are typically not mutually interchangeable! Cf. data markets…) 13 Datafication Data transforms business: Uber: Taxi service is a data problem AirBnB: Hotel service is a data problem Spotify: Music service is a data problem Netflix: TV service is a data problem CERN: Research is a data problem Typical Big Data Success Story Modeling users through Big Data – Online ads sale / placement [e.g., Facebook] – Personalized Coupons [e.g., Target] – Product Placement [Walmart] – Content Generation [e.g., NetFlix] – Personalized learning [e.g., Duolingo] – HR Recruiting [e.g., Gild] The 3-Vs of Big Data Volume – Amount of data Velocity – speed of data in and out Variety – range of data types and sources [Gartner 2012] "Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization" Data “Science” Lifecycle © Brad Severtson Data Science Infrastructure (circa 2024) Data Processing Source Model Interaction Deployment Ingestion Data Source & & Layer Lake Visualization Custom Source Applications Models © P. Cudre-Mauroux https:\\exascale.info Datafication of the Models Today, models are dominated by Deep Learning techniques that are pushed by leading American and Chinese IT companies ©nvidia 4 Main Types of Learning Supervised learning: the learning system is presented with example inputs and outputs (labels); the goal is to learn a general rule that maps inputs to outputs. Unsupervised learning: no labels are given to the learning system, leaving it on its own to find structure in its input. Reinforcement learning: the learning system interacts with a dynamic environment to perform a certain goal (i.e., driving, playing a game); as it navigates the problem space, the system is given feedback (reward), which it tries to maximize. Self-supervised learning: leverages very large, unlabeled data to learn useful representations from a pretext task, which however will help solve downstream tasks. Deep Learning Feeds on Big Data Requires gigantic amounts of annotated data (examples) ©G. Seif Requires enormous computation (GPUs, TPUs) AI Craves for Data… 22 Focus on Data, not Models? “If 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team.” [Andrew Ng, 2021] © Andrew Ng Deep Learning can solve many tasks today Deep Learning can beat humans (1/2) Deep Learning can beat humans (2/2) The Next Data Revolutions (1/2) smarter cities The Next Data Revolutions (2/2) precision medicine AI Biggest Threat Artificial General Intelligence (AGI, strong AI), surely? We Don’t Know When AGI Will Be a Reality Zero consensus from experts today – AI keeps surprising us year after year! None of the techniques we developed so far exhibits “intelligence” My guess? AGI will come… eventually – Yann LeCun: not coming anytime soon – Yoshua Bengio: nobody knows AI Biggest Threat is Already Here Bad, stupid, biased models! “People worry that computers will get too smart and take over the world, but the real problem is that they’re too stupid and they’ve already taken over the world” Prof. P. Domingos Current Risks Technological & Social risks: Technically, current models are… – Narrow (over-specialized) – Statistical black-boxes (no intelligence) – Easy to attack – Biased – Power and data-hungry AI Spectacular Failures (1/4) Hundreds of models were developed to detect COVID – E.g., based on x-rays or CT scans None (i.e., exactly zero) was useful in practice [MIT Tech. Review 2021, Nature Machine Intelligence 2021] ➡ Overfitting AI Spectacular Failures (2/4) Amazon’s recruitment tool – Automated CV classification [Guardian 2018] ➡ Biased training data AI Spectacular Failures (3/4) Tesla autopilot crashing into parked police cars [CNN 2021, ArsTechnica 2021] ➡ Out-Of-Distribution AI Spectacular Failures (4/4) Pour quoi le Valais est-il connu ? – “Le Valais est une région de Suisse connue pour sa nature spectaculaire […]. La région est également connue pour ses manifestations culturelles et sportives, telles que la Fête de la Reine des Neiges et la Course de l'Escalade”. [ChatGPT 01/23] ➡ Hallucinations La Fête de la reine des neiges en Valais [©Andrei Kucharavy + Midjourney] Social Risks (1/2) Misuse of powerful models Social Risks (1/2) “In the beginning was the word. Language is the operating system of human culture. From language emerges myth and law, gods and money, art and science, friendships and nations and computer code. A.I.’s new mastery of language means it can now hack and manipulate the operating system of civilization. By gaining mastery of language, A.I. is seizing the master key to civilization, from bank vaults to holy sepulchers.” Yuval Harari, March 2023. => Role of global legislation (?) and ethical guidelines (?) Social Risks (2/2) Autodetermination? Job losses? Data Today in Big Data is not a new technology: it's a fact; – Deal with it è POCs and productized in most banks, insurance companies, etc. Largely behind US (and Asia) Leader in EU landscape – Research þ – Large Companies þ – SMEs ý – Administrations ý That Being Said, Opportunities Are…. … endless Future is bright for data+ML – Automate any domain where repeated evaluation is possible and cheap A. Karpathy – ML is too good not to use but too dangerous to use C. Curino Thanks a lot for your attention! https://exascale.info