FADS_all_slide (1) PDF - Fundamentals of Agro-Environment Data Science 2022/2023

Document Details

CleanerGyrolite6237

Uploaded by CleanerGyrolite6237

2023

Tags

data science agro-environment data analysis natural resources

Summary

This document is an introductory lesson on data science, specifically focused on agricultural and environmental data science. It covers the course's overview, assessment details, including projects, assignments and participation, and details about the key concepts and skills related to data science.

Full Transcript

Fundamentals of Agro-Environment Data Science Lesson 01 - Welcome! What is data science? Welcome! Overview of the course: - What is data science? - Data Science Methodology - Tools for Data Science - Resources of Data Science - Data Science culture...

Fundamentals of Agro-Environment Data Science Lesson 01 - Welcome! What is data science? Welcome! Overview of the course: - What is data science? - Data Science Methodology - Tools for Data Science - Resources of Data Science - Data Science culture Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 01 - 2 Assessment Components: - Assignments (40%) or final exam Two short assignments focused on operational knowledge - Project (40%) - group work Report on the identification and definition of a problem to be solved through data science - Participation (20%) Weekly practice exercises Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 01 - 3 Project - work group Goal Create and design a data science project on natural resources, food or environment Components - identify and justify an unanswered question in your field of interest - identify skills profile to include in the project team, and their responsibilities - identify data sources, challenges and strategies to pre-process data - identify modelling approaches - identify the implementation path to deliver the solution - the storytelling Deliverable - written report - a presentation - if the project includes a product (a dashboard, a web application), the mockup Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 01 - 4 Welcome kit Components: 1. Python 2. Jupyter Notebooks 3. VPN https://isa-ulisboa.github.io/greends-welcome-kit/ 4. Google for Education account 5. Google Collaboratory 6. Git 7. Github 8. Text editor 9. MariaDB 10. Dbeaver 11. Discord Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 01 - 5 What is data science? Activity - in your own (few) words: - What is Data Science? - What do Data Scientists do? - What tools are used by Data Scientists? - What is particular about Data Science applied to Natural Resources and Environment? JAMBOARD Activity https://tinyurl.com/greends-fads-01 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 01 - 6 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 01 - 7 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 01 - 8 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 01 - 9 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 01 - 10 What is data science? - definition Concept and definition under construction… Data Science refers to an emerging area of work concerned with the collection, preparation, analysis, visualization, management, and preservation of large collections of information https://uk.sagepub.com/en-gb/eur/an-introduction-to-data-science/ book256486 “This coupling of scientific discovery and practice involves the collection, management, processing, analysis, visualization, and interpretation of vast amounts of heterogeneous data associated with a diverse array of scientific, translational, and inter-disciplinary applications.” https://doi.org/10.1080/10618600.2017.1384734 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 01 - 11 What is data science? - definition Concept and definition under construction… A discipline-based data science formula is given as follows: Definition 2.1 (Data Science). A high-level statement is: data science = statistics + informatics + computing “data science is the science of data” or “data science is + communication + sociology + the study of data.” management | data + environment + thinking Definition 2.2 (Data Science). From the disciplinary perspective, data science is a new interdisciplinary field where “|” means “conditional on.” that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology The outputs of data science are data products to study data and its environments (including domains and other contextual aspects, such as organizational and Definition 2.3 (Data Products). A data product is a social aspects) in order to transform data to insights and deliverable from data, or is enabled or driven by decisions by following a data-to-knowledge-to-wisdom data, and can be a discovery, prediction, service, thinking and methodology. recommendation, decision-making insight, thinking, model, mode, paradigm, tool, or system. The ultimate data products of value are knowledge, intelligence, wisdom, and decision. https://doi.org/10.1145/3076253 12 What is data science? - definition Data science is a process data analysis of vast amounts of data computer power reveal new knowledge uncovered in organisations specific problem - question - curiosity data need - structured and unstructured explore patterns - model new approach communicate - data visualization and storytelling change Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 01 - 13 What is data science? - definition Using data as a production factor - consequences for companies: Basis for decision-making: Data allows well-informed decisions, making it vital for all organizational functions. Quality level: Data can be available from different sources; information quality depends on the availability, correctness, and completeness of the data. Need for investments: Data gathering, storage, and processing cause work and expenses. Degree of integration: Fields and holders of duties within any organization are connected by informational relations, meaning that the fulfillment of said duties largely depends on the degree of data integration. Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 01 - 14 What is data science? - skills The students of Green Data Science are curious; know how to characterize a problem; have a taste for technologies; like the interface between Data Science and application areas; like teamwork thave a taste for math and statistics applications; know how to tell a story; know how to program and reuse code; know how to use the cloud and collaborative tools; know how to analyze data; know how to manage data; enjoy lifelong learning Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 01 - 15 What is data science? - skills Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician. David Donoho, https://doi.org/10.1080/10618600.2017.1384734 Data scientist: someone who finds solutions to problems by analyzing Big or small data using appropriate tools and then tells stories to communicate her findings to the relevant stakeholders. Murtaza Haider, Getting Started with Data Science: Making Sense of Data with Analytics. IBM Press, 2015. Drew Conway, http://drewconway.com/zia/2013/3/26/the-data-science- venn-diagram, on 2022-09-05 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 01 - 16 What is data science? - skills TECHNICAL SKILLS SOFT SKILLS Expert knowledge Programming (python, SQL) Data intuition Statistics and Multivariate statistics Communication Mathematics Teamwork Data management systems Data Extraction, Transformation and Loading Data wrangling Machine Learning and Deep Learning Processing large datasets Reporting tools, data visualization Model Deployment Cloud computing Drew Conway, http://drewconway.com/zia/2013/3/26/the-data-science- venn-diagram, on 2022-09-05 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 01 - 17 What is data science? - application examples Pest detection and counting in granaries Chen, C., Liang, Y., Zhou, L., Tang, X., & Dai, M. (2022). An automatic inspection system for pest detection in granaries using YOLOv4. Computers and Electronics in Agriculture, 201, 107302. https://doi.org/10.1016/j.compag.2022.107302 mean average precision: Problem: stored-grain pests cause serious economic losses and safety hazards Question: Can detection accuracy increase (species and number of pests)? Proposed solution: Automatic system of pest detection and counting using CNN 97.55% Red Flour Beetle, Rice Weevil 18 What is data science - application examples Identifying veraison process of colored wine grapes Shen, L., Chen, S., Mi, Z., Su, J., Huang, R., Song, Y., Fang, Y., & Su, B. (2022). Identifying veraison process of colored wine grapes in field conditions combining deep learning and image analysis. Computers and Electronics in Agriculture, 200, 107268. https://doi.org/10.1016/j.compag.2022.107268 test accuracy for three grapes: Problem: An automated analysis of veraison processes is necessary and valuable for viticulturists Question: Can accurate identification of the veraison be automated? > 91% Proposed solution: Deep learning and image analysis to identify colored wine grape veraison in field environments 19 What is data science? - environment Data Science Environment TOOLS: Programm Cloud and ing Data deployme Linux, MacOS, Windows language stores nt Python, SQL Operating Code Version System editors control Visual Studio Code, Notepad++ MariaDB, MySQL Git, Github Google Cloud, IBM Cloud Obtain Scrub Explore Model Interpret Data Science Methodology (OSEMN) Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 01 - 20 Additional reading Cao, L. (2017). Data Science: A Comprehensive Overview. ACM Computing Surveys, 50(3), 43:1-43:42. https://doi.org/10.1145/3076253 Donoho, D. (2017). 50 Years of Data Science. Journal of Computational and Graphical Statistics, 26(4), 745–766. https://doi.org/10.1080/10618600.2017.1384734 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 01 - 21 Fundamentals of Agro-Environment Data Science Lesson 02 - Data Science Environment What is data science? - wrap up A discipline-based data science formula is given as follows: data science = statistics + informatics + computing + communication + sociology + management | data + environment + thinking Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 02 - 2 Data science - environment A discipline-based data science formula is given as follows: data science = statistics + informatics + computing + communication + sociology + management | data + environment + thinking Components of your data science workflow - manage files and directories - bash - develop/adapt Python code - Jupyter Notebook/Colab - manage versioning and share code - Git and GitHub - structure and document your project Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 02 - 3 Data science project management best practices Exercise 1 - Manipulate files and directories from the command line - https://tinyurl.com/greends-fads-ex01 Exercise 2 - Use Git and GitHub to manage and share your code versions - https://tinyurl.com/greends-fads-ex02 Exercise 3 - Data Science Projects management best practices - https://tinyurl.com/greends-fads-ex03 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 02 - 4 Data science culture - an example - Global Biodiversity Information Facility Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 02 - 5 GBIF—the Global Biodiversity Information Facility — is an international network and data infrastructure funded by the world's governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth. Palinurus elephas (Fabricius, 1787) Observed in Portugal by Ana Santos (licensed under http://creativecommons.org/licenses/by-nc/4.0/) Data Science Culture! An example Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 02 - 7 Data Science Culture! An example Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 02 - 8 Data Science Culture! An example Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 02 - 9 Data Science Culture! An example DIKW PYRAMID WISDOM KNOWLEDGE INFORMATION http://www.biodiversityinformatics.org DATA Hobern D et al. (2012) Global Biodiversity Informatics Outlook: Delivering biodiversity knowledge in the information age. Copenhagen: Global Biodiversity Information Facility. Available at: https://doi.org/10.15468/6jxa-yb44. Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 02 - 10 Data Science Culture! An example Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 02 - 11 Data Science Culture! An example Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 02 - 12 Propose a framework Exercise: create a new framework for Data Science in your area of interest: - agriculture, environment, forestry, food, etc. - what blocks would you propose to be take part of the layers - data - evidence - understanding Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 02 - 13 Fundamentals of Agro-Environment Data Science Lesson 03 - Data Science Methodology Data Science Methodology Content: - Examples of Data Science methodology - Data Management Plan - FAIR data principles Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 2 Data Science Methodology Examples: KDD, CRISP DM, SEMMA, OSEMN Obtain Scrub Explore Model Interpret KDD CRISP DM OSEMN Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 3 Data Science Methodology Knowledge Discovery in Databases (KDD) Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From Data Mining to Knowledge Discovery in Databases. AI Magazine, 17(3), 3. Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 4 https://doi.org/10.1609/aimag.v17i3.1230 Data Science Methodology Cross-industry standard process for Data Mining ((CRISP-DM) Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 5 Data Science Methodology Pete Chapman (1999); The CRISP-DM User Guide. Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 6 Data Science Methodology Foundational Methodology for Data Science Rollins, J. (2015). Foundational Methodology for Data Science. A 10-stage data science methodology that spans technologies and approaches. IBM Analytics. https://tdwi.org/~/media/64511A895D86457E964174E DC5C4C7B1.PDF Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 7 Data Science Methodology Obtain Scrub Explore Model Interpret OSEMN Obtain data - identify available datasets Scrub / Clean data - data integration, format/structure normalization, fix missing values, remove duplicates, data transformation, anomaly detection/removal Explore - find patterns, identify influential variables (uni/multivariate analysis) Modelling - train predictive models/algorithms, evaluate and refine models Interpret - make insights, simple and priority driven, tell a story Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 8 Data to Insight to Decision Cao, L. (2017). Data Science: A Comprehensive Overview. ACM Computing Surveys, 50(3), 43:1-43:42. https://doi.org/10.1145/3076253 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 9 Data to Insight to Decision Cao, L. (2017). Data Science: A Comprehensive Overview. ACM Computing Surveys, 50(3), 43:1-43:42. https://doi.org/10.1145/3076253 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 10 Explicit/Implicit Cao, L. (2017). Data Science: A Comprehensive Overview. ACM Computing Surveys, 50(3), 43:1-43:42. https://doi.org/10.1145/3076253 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 11 Data Management Plan - DMP A Data Management Plan is a formal document that describes how data will be managed within the scope of an activity (project, initiative, institution), including the period after its conclusion. It must answer the following questions: What kind of data will be produced? What data formats or standards will be used for the data and metadata? How will privacy, security, confidentiality, intellectual property and other rights be ensured? Who can access the data, how will it be accessed and shared? How will data be archived and preserved? Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 12 Data Management Plan - DMP Horizon Europe Data Management Plan template Data summary FAIR data Other research outputs Allocation of resources Data security Ethics Other issues https://enspire.science/wp-content/uploads/2021/09/Horizon-Europe-Data-Management-Plan-Template.pdf Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 13 Data Management Plan - DMP Horizon Europe Data Management Plan template Data summary Will you re-use any existing data and what will you re-use it for? State the reasons if re-use of any existing data has been considered but discarded. What types and formats of data will the project generate or re-use? What is the purpose of the data generation or re-use and its relation to the objectives of the project? What is the expected size of the data that you intend to generate or re-use? What is the origin/provenance of the data, either generated or re-used? To whom might your data be useful ('data utility'), outside your project? https://enspire.science/wp-content/uploads/2021/09/Horizon-Europe-Data-Management-Plan-Template.pdf Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 14 FAIR data principles A community based principles set to optimize and facilitate reuse of data Wilkinson, M. D.et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, sdata201618. https://doi.org/10.1038/sdata.2016.18 FAIR Open data Open data - data available to anyone, that have no restrictions of use or sharing Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 15 FAIR data principles To be Findable: F1. (meta)data are assigned a globally unique and persistent identifier F2. data are described with rich metadata (defined by R1 below) F3. metadata clearly and explicitly include the identifier of the data it describes F4. (meta)data are registered or indexed in a searchable resource Wilkinson, M. D.et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, sdata201618. https://doi.org/10.1038/sdata.2016.18 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 16 Persistent Identifiers What is a Persistent Identifier (PID)? Is a long-lasting (persistent) identifier that provides a unique reference to “something” - a document, a file, a digital object, a record in a database - to be unambiguously referenced. Ex: Your ID card number, a DOI, ORCID IDs, ISBN, GUID Why are PIDs useful? They are unambiguous. And, in the context of internet, they are actionable, i.e, can resolve to link to the item it references. Ex. https://orcid.org/0000-0002-8351-4028 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 17 Persistent Identifiers What do values PIDs mean? Ideally, nothing. The value should be opaque, without meaning. This way, it will be less temptation to change it when context changes. There is no need to be human-readable What properties should PIDs have? - must be globally unique - must exist indefinitely (persisten) - should be opaque - independent generation - should be actionable (link to the object it references) Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 18 FAIR data principles To be Accessible: A1. (meta)data are retrievable by their identifier using a standardized communications protocol A2. the protocol is open, free, and universally implementable A3. the protocol allows for an authentication and authorization procedure, where necessary A4. metadata are accessible, even when the data are no longer available https://doi.org/10.1038/sdata.2016.18 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 19 FAIR data principles To be Interoperable: I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. I2. (meta)data use vocabularies that follow FAIR principles I3. (meta)data include qualified references to other (meta)data https://doi.org/10.1038/sdata.2016.18 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 20 FAIR data principles To be Reusable: R1. meta(data) are richly described with a plurality of accurate and relevant attributes R2. (meta)data are released with a clear and accessible data usage license R3. (meta)data are associated with detailed provenance R4. (meta)data meet domain-relevant community standards https://doi.org/10.1038/sdata.2016.18 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 21 Additional reading Cao, L. (2017). Data Science: A Comprehensive Overview. ACM Computing Surveys, 50(3), 43:1-43:42. https://doi.org/10.1145/3076253 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 03 - 22 Fundamentals of Agro-Environment Data Science Lesson 04 - Data Science Examples Data Science Applications and Tools Content: - Examples of Data Science applications - Tools for Data Science Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 04 - 2 Traceability of food supply chain Problem: According a FAO, before distribution are lost: - 25% of roots and tubers - 20% fruits and vegetables - 8% grains and pulses Ranganathan, V., Kumar, P., Kaur, U., Li, S. H. Q., Chakraborty, T., & - 13% animals Chandra, R. (2022). Re-Inventing the Food Supply Chain with IoT: A Data-Driven Solution to Reduce Food Loss. IEEE Internet of Things Magazine, 5(1), 41–47. Cause: https://doi.org/10.1109/IOTM.003.2200025 - inadequate monitoring - poor handling Possible solution - low cost data pipeline: - traceability with low-power IoT - information Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 04 - 3 https://www.ifoodds.com/ Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 04 - 4 5 Traceability Advantages of traceability: Determine carbon footprint - pre-harvest water use - fertilizer use - planting method - harvesting method - transport - storage - retail Challenges: - multiple countries - multiple actors (farmers, processors, transporters, distributors, retailers) Blockchain, IoT-Powered Smart Edge IoT (RFID, Bluetooth, Zigbee, Z-Wave, and NFC) Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 04 - 6 Connectivity for Traceability IoT-Powered Smart Edge IoT - Local connectivity - ultra-low-power solutions (Bluetooth, Zigbee, RFID, etc.) - Global connectivity (low-power long-range IoT) Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 04 - 7 Application of DS methodology Business Data Data Modelling and Implementatio understanding requirements transformation evaluation n Food lost Status, Data Systems to Decision before location, age of preparation support support distribution each item in and decision the food chain transformation - optimization - lack of to feed - Prediction - information monitoring - IoT sensors information modelling to/form user - poor handling - traceability and decision - optimization systems Possible datas source at solution: trace each step, - interoperabili and monitor in across the ty all of the food whole chain, - standards chain but with PIDs Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 04 - 8 Tools for Data Science Tools are for specific purposes: - programming - open-source, commercial (visual programming) - data - manage, extract, web scraping, transform, analyze and visualize - cloud computing Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 04 - 9 Languages - python Most popular language for data science Clear readable sintaxe, can do many things of other languages High Level General Purpose Tools for many tasks: databases, automation, web scraping, text processing, image processing, machine learning, and data analytics., Libraries for data science: Pandas, NumPy, SciPy, and Matplotlib Libraries for artificial intelligence: TensorFlow, PyTorch, Keras, and Scikit-learn. Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 04 - 10 Languages - SQL Non-procedural language, for managing and querying data Simple and powerful Created to manage data in relational databases Interfaces for many NoSQL and big data repositories Provides direct access to data Standard, can be used with many databases: MySQL, IBM Db2, PostgreSQL, Apache OpenOffice Base, SQLite, Oracle, MariaDB, Microsoft SQL Server Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 04 - 11 Tools for Data Science Categories - Data management - retrieve and persist - ETL (Extract, Transform, Load) - refining and cleaning DATA - Data Visualization - IDE (Integrated Development CODE DEVELOPMENT environment) - Modelling - application of an appropriate algorithm, - code editor MODEL evaluation - build automation - debugger - Model Implementation - Deliver model to final users, USE in production - Versioning and collaboration Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 04 - 12 Tools for Data Science - Code environments Code management - most used are - IDE - version control IDE Version control Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 04 - 13 Fundamentals of Agro-Environment Data Science Lesson 05 - Data Science tools Data Science Applications and Tools Content: - Tools for Data Science - languages, libraries - Jupyter Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 05 - 2 Tools for Data Science Tools are for specific purposes: - programming - open-source, commercial (visual programming) - data - manage, extract, web scraping, transform, analyze and visualize - cloud computing Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 05 - 3 Tools for Data Science Categories - Data management - retrieve and persist - ETL (Extract, Transform, Load) - refining and DATA cleaning - IDE (Integrated Development CODE DEVELOPMENT - Data Visualization environment) - code editor MODEL - Modelling - application of an appropriate - build automation algorithm, evaluation - debugger USE - Model Implementation - Deliver model to final - Versioning and collaboration users, in production Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 05 - 4 Languages - python Highlights Libraries for data science - Most popular language for data science - Pandas - Clear readable sintaxe, can do many things of other - NumPy languages - SciPy - High Level General Purpose - Matplotlib Tools Libraries for Artificial Intelligence - databases - TensorFlow - automation - web scraping - Keras - text processing - PyTorch - image processing - Scikit-learn. - machine learning - data analytics Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 05 - 5 Tools for Data Science - Code environments Code management - most used are - IDE - version control IDE Version control Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 05 - 6 Tools for Data Science - Jupyter Notebook Development environment: Jupyter - Interactive programming - supports many languages (kernels) - combines text (markdown), code and outputs in a notebook, to conduct data science experiments - exports to pdf or HTML - cloud: Google Collab Current versions - Notebook - Lab (next generation - open several files at once, adds terminals, text editors) Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 05 - 7 Demo - Jupyter Notebook Jupyter Notebook demo - dashboard - create - kernel - insert cells and set cell type - run cells/run all cells/run and move to next cell - delete cells / move cells - restart kernel / shutdown Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 05 - 8 Python - Libraries Libraries - A set of functions or modules to perform a wide variety of actions. Scientific Computing Visualization Machine Learning and Deep Learning Deep Learning (Scikit Learn) graphs, plots, highly ML, regression, low level deep learning data analysis and data customisable classification, clustering (for production) handling - dataframe (Keras) based on Matplotlib, with easy to build standard high level deep learning, numerical computing - deep learning models heat maps, more charts (for experimentation) arrays (can use GPU) and plots Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 05 - 9 Python - Libraries Libraries - A set of functions or modules to perform a wide variety of actions. Parallel computing multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters Can be used with python, SQL and other languages Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 05 - 10 First exercise with Jupyter Ex 05 Jupyter Notebook: https://github.com/isa-ulisboa/greends-fads-exercises Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 05 - 11 Fundamentals of Agro-Environment Data Science Lesson 06 - Data Science tools Data Science Applications and Tools Content: - Tools for Data Science - Application Programming Interfaces (API - data sources Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 06 - 2 Tools for Data Science Categories - Data management - retrieve and persist DATA - ETL (Extract, Transform, Load) - refining and cleaning - IDE (Integrated Development CODE DEVELOPMENT - Data Visualization environment) - code editor MODEL - Modelling - application of an appropriate algorithm, evaluation - build automation - debugger - Model Implementation - Deliver USE model to final users, in production - Versioning and collaboration Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 06 - 3 Data for Data Science What is a data set? Structured collection of data - data - normally structured - tabular data - hierarchical data, network data - raw files - description of the data - metadata - provides context Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 06 - 4 Data for Data Science Ownership of data - Private data - confidential - private or personal information - commercially sensitive - Open data - Scientific institutions economy - Governments society - Organizations healthcare - Companies transportation - environment Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 06 - 5 Data Spectrum Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 06 - 6 Data Spectrum Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 06 - 7 Motivations for Open data adoption https://data.europa.eu/en Motivations for Open data adoption Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 06 - 16 Open Data - wrap up Data is open if anyone can access, use and share it. - no limitations in any use - free to use, modify, combine and share, including commercially - free to use, but there can be a cost to access - there are costs associated with creating, maintaining and publishing usable data - fee related to reproduction cost, but it is often negligible - free to use, reuse and distribute, even commercially - value of open data is measured by how it can be used, not by how its made available - should be structured and machine readable European Commission "What is open data?" https://data.europa.eu/elearning/en/#/id/co-01 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 06 - 17 Licences Creative Commons (CC) CC Zero (CC0) – Attribution (BY) – Attribution + Noncommercial uses (BY-NC) – Attribution + No derivatives (BY-ND) – Attribution + Share under the same terms (BY-SA) – Attribution + Noncommercial uses + No derivatives (BY-NC-ND) – Attribution + Noncommercial uses + Share under the same terms (BY-NC-SA) – Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 06 - 18 5 Star Open Data https://5stardata.info/en/ ★ - an open licence ★★ - re-usable format (structured) ★★★ - open format (not proprietary, e.g. excel) ★★★★ - use identifiers to reference, use URLs to identify things, so that people find it ★★★★★ - link your data to other people’s data to provide context Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 06 - 19 Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 06 - 21 Open data and data quality Open data is subject to mechanisms of 1. Transparency 2. Community feedback/verification 3. Adoption of open standards 4. Correct use and citation of use Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 06 - 22 Tools for Data Science - Data Data sources - United Nations - UNDATA - http://data.un.org/ - many databases from partner organisations: http://data.un.org/Partners.aspx - FAO (several databases) - https://www.fao.org/statistics/databases/en/ - FAOSTAT - https://www.fao.org/faostat/en/#data - Copernicus Open Access Hub - Satellite data - https://scihub.copernicus.eu/ - European data - https://data.europa.eu/en - US Open Data - https://data.gov/ - Dados abertos da administração pública - Portugal - https://dados.gov.pt/pt/ - Data Portals - http://datacatalogs.org/ - Kaggle - online community - find and publish data sets - https://www.kaggle.com/ - Google Data Search - https://datasetsearch.research.google.com/ Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 06 - 23 Tools for Data Science API - Application Programming Interfaces - Definitions and protocols to make two computer components to communicate to each other API inputs Application A DATA Application B Data outputs Application A does not need to “know” anything about Application B and vice-versa… Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 06 - 24 API Tools for Data Science inputs Application A DATA Application B Da ta outputs API - Application Programming Interfaces - Information is transferred through the internet, using the HTTP protocol. Data is transferred in a structured format (.g. JSON) https://api.gbif.org/v1/species/search?q=Pinus%20pinea&rank=SPECIES https://www.gbif.org/species/search?q=Pinus%20pinea&rank=SPECIES On the terminal $ curl https://api.gbif.org/v1/species/search?q=Pinus%20pinea&rank=SPECIES result nubKey: 5285165 https://api.gbif.org/v1/occurrence/search?taxonKey=5285165&country=PT Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 06 - 25 Fundamentals of Agro-Environment Data Science Lesson 07 - Data Science tools Data Science Applications and Tools Content: - Tools for Data Science - Application Programming Interfaces (API) - web scraping Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 07 - 2 Tools for Data Science - Data Data sources - United Nations - UNDATA - http://data.un.org/ - many databases from partner organisations: http://data.un.org/Partners.aspx - FAO (several databases) - https://www.fao.org/statistics/databases/en/ - FAOSTAT - https://www.fao.org/faostat/en/#data - Copernicus Open Access Hub - Satellite data - https://scihub.copernicus.eu/ - European data - https://data.europa.eu/en - US Open Data - https://data.gov/ - Dados abertos da administração pública - Portugal - https://dados.gov.pt/pt/ - Data Portals - http://datacatalogs.org/ - Kaggle - online community - find and publish data sets - https://www.kaggle.com/ - Google Data Search - https://datasetsearch.research.google.com/ Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 07 - 3 Tools for Data Science API - Application Programming Interfaces - Definitions and protocols to make two computer components to communicate to each other API inputs Application A DATA Application B Data outputs Application A does not need to “know” anything about Application B and vice-versa… Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 07 - 4 API Tools for Data Science inputs Application A DATA Application B Da ta outputs API - Application Programming Interfaces - Information is transferred through the internet, using the HTTP protocol. Data is transferred in a structured format (.g. JSON) https://api.gbif.org/v1/species/search?q=Pinus%20pinea&rank=SPECIES https://www.gbif.org/species/search?q=Pinus%20pinea&rank=SPECIES On the terminal $ curl https://api.gbif.org/v1/species/search?q=Pinus%20pinea&rank=SPECIES result nubKey: 5285165 https://api.gbif.org/v1/occurrence/search?taxonKey=5285165&country=PT Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 07 - 5 API Tools for Data Science inputs Application A DATA Application B Da ta outputs API - Application Programming Interfaces - Information is transferred through the internet, using the HTTP protocol. Data is transferred in a structured format (.g. JSON) - Most common API services implemented are REST - REpresentation State Transfer protocol GBIF API service - documentation - https://www.gbif.org/developer/summary https://api.gbif.org/v1/species/search?q=Pinus%20pinea&rank=SPECIES Scheme Host URL Service parameter parameter version Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 07 - 6 API Tools for Data Science inputs Application A DATA Application B Da ta outputs API - Application Programming Interfaces Data is obtained from the API service through a request and response process. request: with several methods: GET, POST, PUT, DELETE response: provides a code, header and body GBIF API service - documentation - https://www.gbif.org/developer/summary https://api.gbif.org/v1/species/search?q=Pinus%20pinea&rank=SPECIES Scheme Host URL Service parameter parameter version start of query Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 07 - 7 Tools for Data Science Web scraping HTML (Hyper Text Markup Language) is the standard markup language for Web pages. It describes the structure of a page and consists in a series of elements (tags). declares it is a HTML5 document root element of the page start of meta information Page Title title end of meta information body indicates the start of visible content My First Heading a heading My first paragraph. a paragraph end of visible content end of page Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 07 - 8 Tools for Data Science Web scraping - automatically extract information from a website In web scraping, we extract the content of the page, using the structure provided by html. Example: QUALAR, information about air quality, from the national agency for the environment. Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 07 - 9 Tools for Data Science Qualidade do Ar Web scraping Região Concelho Estação Norte Braga Fr Bartolomeu Mártires-S. Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 07 - 10 Tools for Data Science Web scraping Web scraping in python Combine two Python modules: Qualidade do Ar requests + BeautifulSoup Região Concelho Estação Norte Braga Fr Bartolomeu Mártires-S.Vitor Fundamentals of Agro-Environment Data Science - 2022/2023 - Lesson 07 - 11 Fundamentals of Agro-Environment Data Science Lesson 08 - Data Science tools - Open Data - API Data Science Applications and Tools Content: - Tools for Data Science - Application Programming Interfaces (API - data sources Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 2 Tools for Data Science Categories - Data management - retrieve and persist DATA - ETL (Extract, Transform, Load) - - IDE (Integrated Development environment) CODE DEVELOPMENT refining and cleaning - Data Visualization - code editor MODEL - Modelling - application of an - build automation appropriate algorithm, evaluation - debugger - Model Implementation - Deliver USE model to final users, in production - Versioning and collaboration Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 3 Data for Data Science What is a data set? Structured collection of data - data - normally structured - tabular data - hierarchical data, network data - raw files - description of the data - metadata - provides context Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 4 Data for Data Science Ownership of data - Private data - confidential - private or personal information - commercially sensitive - Open data - Scientific institutions economy - Governments society - Organizations healthcare - Companies transportation - environment Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 5 Data Spectrum https://theodi.org/insights/tools/the-data-spectrum/ Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 6 Data Spectrum https://theodi.org/insights/tools/the-data-spectrum/ Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 7 Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 8 Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 9 Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 10 Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 11 Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 12 Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 13 Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 14 Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 15 Motivations for Open data adoption Motivations for Open data adoption Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 16 Open Data - wrap up Data is open if anyone can access, use and share it. - no limitations in any use - free to use, modify, combine and share, including commercially - free to use, but there can be a cost to access - there are costs associated with creating, maintaining and publishing usable data - fee related to reproduction cost, but it is often negligible - free to use, reuse and distribute, even commercially - value of open data is measured by how it can be used, not by how its made available - should be structured and machine readable European Commission "What is open data?" https://data.europa.eu/elearning/en/#/id/co-01 Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 17 Licences Creative Commons (CC) CC Zero (CC0) – Attribution (BY) – Attribution + Noncommercial uses (BY-NC) – Attribution + No derivatives (BY-ND) – Attribution + Share under the same terms (BY-SA) – Attribution + Noncommercial uses + No derivatives (BY-NC-ND) – Attribution + Noncommercial uses + Share under the same terms (BY-NC-SA) – Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 18 5 Star Open Data https://5stardata.info/en/ ★ - an open licence ★★ - re-usable format (structured) ★★★ - open format (not proprietary, e.g. excel) ★★★★ - use identifiers to reference, use URLs to identify things, so that people find it ★★★★★ - link your data to other people’s data to provide context Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 19 Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 20 Open data and data quality Open data is subject to mechanisms of 1. Transparency 2. Community feedback/verification 3. Adoption of open standards 4. Correct use and citation of use Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 21 Tools for Data Science - Data Data sources - United Nations - UNDATA - http://data.un.org/ - many databases from partner organisations: http://data.un.org/Partners.aspx - FAO (several databases) - https://www.fao.org/statistics/databases/en/ - FAOSTAT - https://www.fao.org/faostat/en/#data - Copernicus Data Space Ecosystem- Satellite data - https://dataspace.copernicus.eu/ - European data - https://data.europa.eu/en - US Open Data - https://data.gov/ - Dados abertos da administração pública - Portugal - https://dados.gov.pt/pt/ - Data Portals - http://datacatalogs.org/ - Kaggle - online community - find and publish data sets - https://www.kaggle.com/ - Google Data Search - https://datasetsearch.research.google.com/ Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 22 Tools for Data Science - Access API - Application Programming Interfaces - Definitions and protocols to make two computer components to communicate to each other API inputs Application A DATA Application B Data outputs Application A does not need to “know” anything about Application B and vice-versa… Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 23 API Tools for Data Science inputs Application A DATA Application B Da ta outputs API - Application Programming Interfaces - Information is transferred through the internet, using the HTTP protocol. Data is transferred in a structured format (.g. JSON) https://api.gbif.org/v1/species/search?q=Pinus%20pinea&rank=SPECIES https://www.gbif.org/species/search?q=Pinus%20pinea&rank=SPECIES On the terminal $ curl -X 'GET' \ 'https://api.gbif.org/v1/species/search?q=Pinus%20pinea&rank=SPECIES' \ -H 'accept: application/json' result nubKey: 5285165 https://api.gbif.org/v1/occurrence/search?taxonKey=5285165&country=PT Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 24 API Tools for Data Science inputs Application A DATA Application B Da ta outputs API - Application Programming Interfaces - Information is transferred through the internet, using the HTTP protocol. Data is transferred in a structured format (e.g. JSON) - Most common API services implemented are REST - REpresentation State Transfer protocol GBIF API service - documentation - https://www.gbif.org/developer/summary https://api.gbif.org/v1/species/search?q=Pinus%20pinea&rank=SPECIES Scheme Host URL Service parameter parameter version Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 25 API Tools for Data Science inputs Application A DATA Application B Da ta outputs API - Application Programming Interfaces Data is obtained from the API service through a request and response process. request: with several methods: GET, POST, PUT, DELETE response: provides a code, header and body GBIF API service - documentation - https://www.gbif.org/developer/summary https://api.gbif.org/v1/species/search?q=Pinus%20pinea&rank=SPECIES Scheme Host URL Service parameter parameter version start of query Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 26 Tools for Data Science Web scraping HTML (Hyper Text Markup Language) is the standard markup language for Web pages. It describes the structure of a page and consists in a series of elements (tags). declares it is a HTML5 document root element of the page start of meta information Page Title title end of meta information body indicates the start of visible content My First Heading a heading My first paragraph. a paragraph end of visible content end of page Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 27 Tools for Data Science Web scraping - automatically extract information from a website In web scraping, we extract the content of the page, using the structure provided by html. Example: QUALAR, information about air quality, from the national agency for the environment. https://qualar.apambiente.pt/downloads Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 28 Tools for Data Science Qualidade do Ar Web scraping Região Concelho Estação Norte Braga Fr Bartolomeu Mártires-S.Vitor Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 29 Tools for Data Science Web scraping Web scraping in python Combine two Python modules: Qualidade do Ar requests + BeautifulSoup Região Concelho Estação Norte Braga Fr Bartolomeu Mártires-S.Vitor Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 30 Markdown and first exercise with Jupyter Ex 07 Create your data sources catalogue Ex 08 Find and use data online - API - Web Scraping https://github.com/isa-ulisboa/greends-fads-exercises Fundamentals of Agro-Environment Data Science - 2024/2025 - Lesson 08 - 31 Fundamentals of Agro-Environment Data Science Lesson 09 - Overview of modelling approaches in Data Science Modelling in Data Science Content: - Machine Learning - Modelling approaches - supervised - unsupervised - semi-supervised - reinforcement learning Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 2 Artificial Intelligence ARTIFICIAL INTELLIGENCE Mimicking the intelligence or behavioural pattern of humans or Generative any other living entity. AI Artificial Intelligence Predictive Photo by Igor Omilaev on Unsplash AI Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 3 Artificial Intelligence Artificial Intelligence Predictive AI Generative AI blends statistical generates original analysis with machine content (audio, learning algorithms to images, text, code) find data patterns and in response to a forecast future prompt Photo by Igor Omilaev on Unsplash outcomes Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 4 Artificial Intelligence Differences Generative AI Predictive AI Data trained on large datasets containing millions of use smaller, more targeted datasets as input data sample content Model models draw from the encoded patterns and models extracts insights from historical data to relationships in their training data to understand make accurate predictions about the most likely user requests and create relevant new content upcoming event, result or trend that’s similar, but not identical, to the original data Output creates novel content forecasts future events and outcomes Purpose creating new data that mimics human creations understanding and predicting based on existing data Algorithms LLMs, Generative adversarial networks (GANs), clustering, decision trees, regression models, time Variational autoencoders (VAEs) series Use cases Customer service, gaming, healthcare, marketing, Financial forecasting, fraud detection, inventory software development management, yield forecasting, personalized recommendations Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 5 Machine Learning ARTIFICIAL INTELLIGENCE Mimicking the intelligence or behavioural pattern of humans or any other living entity. MACHINE LEARNING A technique by which a computer can "learn" from data, without using a complex set of different rules. This approach is mainly based on training a model from datasets. DEEP LEARNING A technique to perform machine learning inspired Photo by h heyerlein on Unsplash by our brain's own network of neurons. Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 6 Machine Learning Machine Learning Set of statistical methods based on algorithms that are trained to make classifications or predictions, and find patterns in data. To do ML, we need: Training data - subset of data with examples used to fit parameters of the model Test data - subset of data, independent of the train data, used to assess the performance of the model Photo by h heyerlein on Unsplash Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 7 Machine Learning Machine Learning Visual intuition: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/ Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 8 Machine Learning components ML COMPONENTS Updating or optimization process 03 A method in which the algorithm looks at the miss and then updates how the decision process comes to the final decision, so next time the miss won’t be as Decision process great. A recipe of calculations or other steps that takes in the data and “guesses” what kind of pattern your algorithm is looking to find. Error function 01 02 A method of measuring how good the guess was by comparing it to known examples (when they are available). Did the decision process get it right? If not, how do datascience@berkeley, the online Master of Information and Data Science from UC Berkeley, you quantify “how bad” the miss https://ischoolonline.berkeley.edu/blog/what-is-machine-learning/ was? Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 9 ML - overfitting and underfitting Overfitting - model learns the patterns of the train data, but does not generalize, i.e., does not perform well in other data Underfitting - model is too much generalized, cannot adjust well to data Cross-validation: run model on different subsets of data. A simple approach is to create two subsets, train and test sets. Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 10 ML - measuring performance True positives - predicted to have the property and has the property True negatives - predicted to not to have the property and has not the property False positives - predicted to have the property and but has not the property False negatives - predicted to not to have the property but has the property https://en.wikipedia.org/wiki/Precision_and_recall Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 11 ML - measuring performance True positive rate (Sensitivity/Recall) - of all samples that should be flagged by the classifier, how many were flagged. Best value: 1 False positive rate (Specificity) - of all samples that should NOT be flagged by the classifier, how many were flagged. Best value: 0 False negative rate - of all samples that should NOT be flagged by the classifier, how many were flagged. Best value: 0 CONFUSION MATRIX TPR = TP/(TP+FN) Predicted Positive Predicted Negative FPR = FP/(FP+TN) Actual Positive True Positive (TP) False Negative (FN) Actual Negative False Positive (FP) True Negative (TN) Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 12 ML - measuring performance True positive rate (Sensitivity/Recall) - of all samples that should be flagged by the classifier, how many were flagged. Best value: 1 False positive rate (Specificity) - of all samples that should NOT be flagged by the classifier, how many were flagged. Best value: 0 CONFUSION MATRIX TPR = TP/(TP+FN) = 85/(85+15) = 0.85 Predicted Spam Predicted Not Spam FPR = FP/(FP+TN) = 10/(10+90) = 0.10 Actual Spam 85 15 Actual Not Spam 10 90 Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 13 ML - measuring performance ROC (receiver operating characteristic) curve. The calculation of the AUC (area under the curve) can be used as indicator. Best values are close to 1, worst case is close to 0.5. Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 14 Machine Learning methods SL UL SML RL Supervised Learning Unsupervised Semi-supervised Reinforcement Learning Learning Learning Use of labeled Analyze and cluster During training, it uses a The dataset uses a datasets to train unlabeled datasets smaller labeled data set “rewards/punishments” to guide classification system, offering feedback algorithms to classify and feature extraction to the algorithm to learn data or predict from a larger, unlabeled from its own experiences outcomes accurately. data set by trial and error Classification Clustering Semi-supervised Reinforcement Regression Dimension learning learning reduction Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 15 ML - supervised learning Labeled datasets Regression - output variable is a real or continuous variable ○ Linear regression ○ Non-linear regression ○ Logistic regression Classification - output variable is a categorical variable ○ support vector machines ○ decision trees (also regression) ○ random forest (also regression) ○ K-NN (also regression) Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 16 ML - supervised learning Linear regression – model of the relationship between a scalar response value (dependent variable) and one or more explanatory variables (independent variables) f(x,β)=β0+β1x1+⋯+βnxn Non-Linear regression - model has a non-linear form Polynomial regression - is a form of non-linear regression that models the relationship between a dependent and independent variable as nth degree polynomial. f(x,β)=β0+β1x+β2x2+⋯+βnxn See difference explained at Stackexchange. Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 17 ML - supervised learning Logistic regression – Statistical model for a binary dependent variable Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 18 ML - supervised learning Decision trees Every node puts a threshold on some variable – if you are below that threshold go to the left and otherwise go to the right (for categorical variables, like gender, it’s a yes-no question rather than a threshold). Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 19 ML - supervised learning Random forests (RF) and XGBoost Ensemble of several decision trees bootstrapped and aggregated. Difference in how trees are built: RF - bagging (independent trees); XGBoost - boosting (sequential trees) Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 20 ML - supervised learning Support Vector Machine (SVM) - A support vector machine constructs a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin) Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 21 ML - supervised learning Support Vector Machine (SVM) - application of a kernel to remap points in another space Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 22 ML - supervised learning k-nearest neighbors (K-NN) – non-parametric method that assigns, in a classification problem, the class of a data point to the most common among of its k nearest neighbors. Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 09 - 23 Fundamentals of Agro-Environment Data Science Lesson 11 - Unsupervised Learning Modelling in Data Science Content: - Machine Learning - Modelling approaches - supervised - unsupervised - semi-supervised - reinforcement learning Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 11 - 2 Machine Learning methods SL UL SML RL Supervised Learning Unsupervised Semi-supervised Reinforcement Learning Learning Learning Use of labeled Analyze and cluster During training, it uses a The dataset uses a datasets to train unlabeled datasets smaller labeled data set “rewards/punishments” to guide classification system, offering feedback algorithms to classify and feature extraction to the algorithm to learn data or predict from a larger, unlabeled from its own experiences outcomes accurately. data set by trial and error Classification Clustering Semi-supervised Reinforcement Regression Dimension learning learning reduction Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 11 - 3 Machine Learning methods SL UL SML RL Supervised Learning Unsupervised Semi-supervised Reinforcement Learning Learning Learning Use of labeled Analyze and cluster During training, it uses a The dataset uses a datasets to train unlabeled datasets smaller labeled data set “rewards/punishments” to guide classification system, offering feedback algorithms to classify and feature extraction to the algorithm to learn data or predict from a larger, unlabeled from its own experiences outcomes accurately. data set by trial and error Classification Clustering Semi-supervised Reinforcement Regression Dimension learning learning reduction Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 11 - 4 ML - unsupervised learning Non-labeled datasets - identify commonalities in the data Clustering - grouping unlabeled data based on their similarities or differences ○ K-means clustering ○ Hierarchical clustering Dimensionality reduction - reduces the number of features (variables) a manageable size, preserving the data integrity ○ Principal component analysis Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 11 - 5 ML - unsupervised learning Curse of dimensionality Dimensionality - number of features - in theory: + features → + performance - in practice: too many features, worse performance (autocorrelations, more noise, non-meaningful, …) We need to perform a dimension reduction Fundamentals of Agro-Environment Data Science - 2024/2025- Lesson 11 - 6 ML - unsupervised learning Clustering K-Means Algorithm - create groups from unlabeled data Iterations: - select a data point - find the samples closest to that point - find centroid (mean) - repeat until convergence Requires a priori definition of the number of

Use Quizgecko on...
Browser
Browser