Emerging Tech Notes v3.pdf
Document Details
Uploaded by Deleted User
Full Transcript
Emerging Trends in Technology Introduction to Data Science, AI & Machine Learning Kunal Kishore Introduction to Data Science, AI & Machine Learning Module 1 : Data Science & Analytics 1 Introduction t...
Emerging Trends in Technology Introduction to Data Science, AI & Machine Learning Kunal Kishore Introduction to Data Science, AI & Machine Learning Module 1 : Data Science & Analytics 1 Introduction to Data Science, AI & Machine Learning 1.1. Introduction to Data Science Learning Objectives Define and differentiate between machine learning and generative AI, understand their key concepts, and explore their applications across various fields. Data is everywhere nowadays. In general, data is a collection of facts, information, and statistics and this can be in various forms such as numbers, text, sound, images, or any other format. Data can be generated by: a) Humans b) Machines c) Human-Machine combines. Data comes from various sources, such as customer transactions, sales records, website logs, and social media interactions. Why is data important ? a) Data enables better decision-making. b) Data identifies the causes of underperformance, aiding in problem-solving. c) Data allows for performance evaluation. d) Data facilitates process improvement. e) Data provides insights into consumers and market trends. Types of Data 2 Introduction to Data Science, AI & Machine Learning Product Recommendations Guess the type of Customers Customer Segment Guesses P1 (iPhone, Beer, Cornflakes): Guess: Young Adult or Single Professional This segment has an iPhone, indicating a preference for high-end technology. The inclusion of beer suggests a lifestyle that might involve socialising or relaxation after work. Cornflakes 3 Introduction to Data Science, AI & Machine Learning are often associated with a quick, convenient breakfast, which is common for busy individuals. Overall, this customer likely enjoys tech, convenience, and leisure activities. P2 (iPhone, Diaper, Beer): Guess: Young Parent Explanation: The presence of an iPhone again indicates an affinity for technology. Diapers suggest that this customer has a baby or young child. The beer might indicate a need for relaxation or socialising, perhaps after dealing with the challenges of parenthood. This segment likely represents a young parent balancing family life with some personal relaxation. P3 (iPhone, Diaper, Biscuits): Guess: Parent or Family-Oriented Individual Like P2, the diapers indicate that this customer likely has a child. The iPhone suggests they are tech-savvy, and biscuits could be a snack for either the child or the parent. This segment appears to be focused more on family-oriented products, possibly indicating a parent or caregiver who is attentive to both the child's needs and their own These guesses are based on the product combinations and typical consumer behaviour patterns associated with those products. Let’s see its association with Data Science The example provided involves analyzing customer purchase data to infer customer profiles and understand purchasing behaviours. 1. Helps retailers understand product affinities, optimize product placements, design cross-selling strategies, and improve inventory management. 2. Classifying customers into different segments based on their purchase behavior and inferring their profiles (e.g., young adult, parent). 3. Enables targeted marketing, personalised recommendations, and improved customer satisfaction by addressing specific needs and preferences of each segment 4. Predicting what other products a customer might be interested in based on current purchasing patterns. 5. Using insights from data analysis to inform marketing campaigns, product placement, and inventory decision The example provided illustrates a simplified scenario of how data science is applied in real-world contexts to understand and predict customer behaviour. Why This Is Data Science Lets Understand first what is Data Science 4 Introduction to Data Science, AI & Machine Learning Data Science is a multidisciplinary field that focuses on finding actionable insights from large sets of structured and unstructured data Data Science experts integrate computer science, predictive analytics, statistics and Machine Learning to mine very large data sets, with the goal of discovering relevant insights that can help the organisation move forward, and identifying specific future events. Who is Data Scientist A data scientist is a professional who uses scientific methods, algorithms, and systems to extract insights and knowledge from structured and unstructured data. They apply a combination of statistical analysis, machine learning, data mining, and programming skills to interpret complex data, identify patterns, and make data-driven decision Skills of a Data Scientist: 1. Programming: Proficiency in languages like Python, R, and SQL for data manipulation and analysis. 2. Statistics and Mathematics: A strong foundation in statistical methods, probability, and linear algebra. 3. Machine Learning: Knowledge of algorithms, model development, and evaluation techniques. 4. Data Visualization: Ability to create visualisations using tools like Matplotlib, Seaborn, Tableau, or Power BI to communicate insights. 5. Domain Knowledge: Understanding the specific industry or business context to apply data science effectively. 6. Critical Thinking: Strong analytical skills to identify patterns, solve complex problems, and make data-driven decisions. Problems that data scientists solve ? ( Type of Problems) Use Case 1: Is it A or B - Will the applicant be able to repay the loan or not ? The problem of determining whether an applicant will be able to repay a loan is a binary classification problem Repay the loan (Yes) Default on the loan (No) 5 Introduction to Data Science, AI & Machine Learning The objective is to predict whether a given applicant belongs to the "repay" class or the "default" class based on various input features (such as income, credit score, existing debt, etc.). The model might output a probability score that represents the likelihood of the applicant repaying the loan. For example, a score of 0.85 might indicate an 85% chance of repayment. Implications of the Decision: Approval: If the prediction is "Yes," the loan may be approved, potentially with terms that reflect the applicant's risk profile (e.g., interest rate, loan amount). This type of problem is fundamental in credit risk assessment, where data science models help financial institutions make informed lending decisions. Use Case 2: Is it weird ( something weird means anomaly) : I am getting so many spam emails in my inbox ? Normally, your inbox receives a certain number of emails per day, with only a small percentage being spam. This is the expected or "normal" behaviour. The sudden increase in spam emails represents a deviation from the norm. In data science, this could be detected using anomaly detection techniques, such as statistical methods (e.g., z-score), machine learning algorithms (e.g., Isolation Forest, Autoencoders), or rule-based systems. By applying anomaly detection techniques, the root cause can be identified and addressed, improving the efficiency of spam filters and enhancing email security. Industry specific Use cases / Problems 6 Introduction to Data Science, AI & Machine Learning Data Science Process 7 Introduction to Data Science, AI & Machine Learning 1.2. Data Collection and Preprocessing Learning Objectives Learn about various data collection methods, understand data quality issues and preprocessing techniques, and explore data cleaning and transformation. Raw Data Raw Data refers to unprocessed information that is collected and stored in its original format. It is the most fundamental form of data, captured directly from various sources, such as sensors, devices, or databases. Raw Data is typically characterized by its lack of structure, organization, or meaningful interpretation. It may include text files, log files, images, audio recordings, or numeric data. Collecting Raw Data Raw Data is acquired from different sources and stored as-is without any transformations or modifications. It can be collected manually or automatically through various methods, such as data extraction tools, IoT devices, or data streaming technologies. Once the Raw Data is collected, it can be stored in databases, data warehouses, or data lakes, where it awaits further processing and analysis By preserving data in its original format, Raw Data ensures data integrity and enables retrospective analysis. Related Terms Data Lake: A data lake is a centralized repository that stores Raw Data in its native format, facilitating data exploration, analysis, and processing. Data Pipeline: A data pipeline refers to the set of processes and tools used to extract, transform, and load (ETL) Raw Data into a destination system for further processing. Data Preprocessing: Data preprocessing involves transforming Raw Data into a standardized, clean format by applying techniques such as cleaning, filtering, and normalization Data warehouses :Store cleaned and processed data in a centralized system. Data warehouses use hierarchical dimensions and tables to store data. Data warehouses can be used to source analytic or operational reporting, and for business intelligence (BI) use cases. Databricks: Databricks is an industry-leading, cloud-based data engineering tool used for processing and transforming massive quantities of data and exploring the data through 8 Introduction to Data Science, AI & Machine Learning machine learning models. Recently added to Azure, it's the latest big data tool for the Microsoft cloud As a data scientist, you'll require data to address the problems you're working on. Occasionally, your organization might already have the data you need. However, if the necessary data isn't being collected, you'll need to collaborate with a data engineering team to develop a system that begins gathering the required data. Storage systems The choice of storage system for raw data depends on various factors, including the volume of data, access patterns, scalability requirements, and budget. 1. Cloud storage solutions like Amazon S3 or Google Cloud Storage are popular for their scalability and integration with analytics tools, 2. while distributed file systems like HDFS are preferred for big data applications. 3. In contrast, databases and data lakes offer structured environments for specific use cases 9 Introduction to Data Science, AI & Machine Learning Who is a Data Engineer? Their work ensures that data is collected, stored, and made accessible in a reliable and efficient manner for use by data scientists, analysts, and other stakeholders. Key responsibility includes : 1. Designing Data Systems: Creating and implementing data architectures that support data storage, processing, and retrieval. 2. Database Design: Designing relational and NoSQL databases to handle structured and unstructured data. 3. ETL Processes: Building and managing ETL (Extract, Transform, Load) pipelines to move data from various sources into data warehouses or lakes. 4. Data Integration: Combining data from different sources to create unified datasets. 5. Building Data Warehouses: Designing and maintaining data warehouses that consolidate data from multiple sources for analysis and reporting. 6. Handling Large-Scale Data: Using technologies like Hadoop, Spark, and Kafka to process and analyze big data. Data Cleaning Data Cleaning is identifying, diagnosing, and correcting the inconsistencies present in the data in line with problems in hand and business processes. Dirty data is data that is incomplete, incorrect, incorrectly formatted, erroneous, or irrelevant to the problem we are trying to solve 10 Introduction to Data Science, AI & Machine Learning How do we know the data is dirty, or more precisely how do we recognize the dirt, and what are the main types of data dirt or defects? Type of defects in data 1) Duplicate Records -Any record that shows up more than once 2) Missing Data -Any data that is missing important fields ( zero or null or extra space) 3) Data Type/Format Error -Data which has inconsistent or incorrect data type/formats -Date or String, numbers as string, Separator Issues 4) Incorrect Data -Any field that is prone to human intervention /error in the process, open-ended fields with no blocker on predefined pattern/type 5) Not updated Fields If any of the fields have not been updated for some reason (cron-related or unnotified changes in the table structure or field names). 6) Outliers/Inconsistency Data having extraordinary values or outliers — not following any pattern 7) Typo Errors. Misspells while collecting/capturing the record in tech. Case 1 Each and every box is weighed and dimensions measured at the time of completing the pickup process in an e-commerce domain by the pickup executive. If it is a manual process and obviously it is in most cases, there are high chances of getting incorrect weights and dimensions captured in the system, and this error can be translated to the whole system and impact other operations like load planning, vehicle optimization, and even wrong invoicing. Can you guess what kind of error you are going to expect in the weight and dimensions datasets in consignment data? Let me help you with some: 1) Weight can be entered wrongly — 100 kg instead of 10 Kg — Incorrect data 2) 10.8 Inch length can be 108.0 inch — Wrong placing of decimals 3) 100 cm can be found as 100 Inch or any other units 11 Introduction to Data Science, AI & Machine Learning Take a pause and think of some more possibilities. if you are thinking about which syntax or code can help to detect such anomalies, not to worry for now Some of them are fixable and some are non-fixable, Non fixable can be corrected or analyzed by data engineering teams sometimes or by bringing some changes in the flow or mechanism of collecting the data for improving the data quality. The standard set of errors has a pre-defined set of techniques that you can learn and build your understanding to keep your eyes open on any dirt present. a) If standard errors are repeating or occurring frequently — that can also be fixed at source by data engineering teams or by doing semi-automation using python scripts b) Combining data from multiple sources/web sources/API can be cleaned by codifying their identified anomalies or string patterns, type, and formats. Case 2 : Data Cleaning with SQL Let’s understand the technique involved by taking the scenario of duplicate records, A duplicate record is when you have two (or more) rows with the same information. In the call center framework, automatic calls are made and assigned to agents, if for some reason it gets triggered twice and gets recorded in the database tables, there is a high chance duplicate records are created. Data Preprocessing Data preprocessing is a crucial step in the data science pipeline, involving the transformation of raw data into a format suitable for analysis and modelling. The primary goal is to enhance data quality and ensure that it is clean, consistent, and ready for subsequent analysis or machine learning tasks. 1. Handling Missing Values: Identifying and addressing missing data through imputation, deletion, or filling with default values. 2. Removing Duplicates: Detecting and eliminating duplicate records to avoid redundancy. 3. Correcting Errors: Fixing inconsistencies and errors in the data, such as incorrect values or typos. 12 Introduction to Data Science, AI & Machine Learning 4. Normalization/Scaling: Adjusting numerical values to a common scale or range, which is essential for many machine learning algorithms. 5. Encoding Categorical Variables: Converting categorical data into numerical format using techniques like one-hot encoding or label encoding. 6. Feature Engineering: Creating new features or modifying existing ones to improve the model’s performance. 7. Merging Datasets: Combining data from different sources or tables to create a unified dataset. 8. Joining Data: Using keys to integrate related datasets, ensuring consistency and completeness. 9. Dimensionality Reduction: Reducing the number of features or variables through methods like Principal Component Analysis (PCA) to simplify the dataset and improve model efficiency. 10. Aggregation: Summarizing data by grouping and calculating statistics to reduce data volume while retaining essential information. 11. Split into Training and Test Sets: Dividing the dataset into training and test subsets to evaluate the performance of machine learning models and prevent overfitting. 13 Introduction to Data Science, AI & Machine Learning 1.3. Exploratory Data Analysis Learning Objectives Perform exploratory data analysis (EDA), interpret data patterns, and gain insights through visualization and statistical methods. You’ve just decided to take a spontaneous trip. Excitement kicks in, and you sit down to book a flight. But instead of rushing into the first deal that pops up, you become a savvy traveller. You open multiple tabs—comparing airlines, ticket prices, and perks. One airline offers free WiFi, another has complimentary meals, and yet another has glowing reviews from happy travellers. You start making mental notes, weighing your options. Should you go with the cheaper flight or the one with better service? This decision-making journey is exactly what Exploratory Data Analysis (EDA) is all about—taking raw information, exploring it from different angles, and finding the best insights before making a choice. Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics and uncover patterns, relationships, and anomalies. It is an essential step in the data analysis process that helps to understand the data before applying more complex statistical or machine learning techniques. Case Study: Exploring Sales Data for a Retail Store You're the data analyst for a retail store, and you want to understand the sales patterns from the past month to make informed decisions about product restocking. Here’s a small sample of sales data for the last 10 days: 14 Introduction to Data Science, AI & Machine Learning How do we Analyse ? 1) You begin by understanding basic stats about the data: a) Total Revenue: By summing the revenue column, you find the store earned $18,100 in 10 days. b) Top-Selling Product: The product "Shoes" has been sold 310 units vs. "Jackets" at 165 units. 2) Visualisations: You can create a simple bar chart showing the units sold per product: a) Chart: Shoes vs. Jackets sales comparison. Shoes: 310 units and Jackets: 165 units.This tells you that shoes are selling almost double compared to jackets. 3) Identifying Patterns :Regional Insights: a) The South and East regions are showing strong sales for both products. b) West and North regions lag slightly, especially in jacket sales. Through this simple EDA, you've identified which products are selling better, which regions need more focus, and how revenue trends are moving. This will help you decide stock levels and marketing strategies for each region. Objective of EDA 1. Communicating information about the data set: how summary tables and graphs can be used for communicating information about the data. a. Tables can be used to present both detailed and summary level information about a data set. b. Graphs visually communicate information about variables in data sets and the relationship between them 2. Summarizing the data and relationships : Statistical approaches to summarizing the data and relationships within the data as well as making statements about the data with confidence. a. Summarizing the data: Statistics not only provide us with methods for summarizing sample data sets, they also allow us to make confident statements about the dataset (entire populations). b. Characterizing the data: Prior to building a predictive model or looking for hidden trends in the data, it is important to characterize the variables and the relationships between them and statistics gives us many tools to accomplish this. c. Making statements about ‘‘hidden’’ facts - Once a group of observations within the data has been defined as interesting through the use of data mining techniques, statistics give us the ability to make confident statements about these groups. 15 Introduction to Data Science, AI & Machine Learning 3. Answering analytical questions a. Series of methods for grouping data / organizing data→ for answering analytical questions (including clustering, associative rules, and decision trees etc.) b. building predictive models Process and methods to be used in building models. series of methods including simple regression, k-nearest neighbours, classification and regression trees, and neural networks Models & Algorithms In data science and machine learning, models and algorithms are fundamental components for solving problems and making predictions based on data. Here’s a brief explanation of each concept and their interplay. Models : A model is a mathematical or computational representation of a real-world process or system. In machine learning, it refers to a trained algorithm that can make predictions or decisions based on input data. Different models can be used to approach a problem from various angles. This is often done to find the best-performing model for a specific task. Types of Models: a. Regression Models: Predict continuous outcomes (e.g., Linear Regression). b. Classification Models: Predict categorical outcomes (e.g., Logistic Regression, Decision Trees). c. Clustering Models: Group similar data points together (e.g., K-Means Clustering). d. Ensemble Models: Combine multiple models to improve performance (e.g., Random Forest, Gradient Boosting Machines). Algorithm : An algorithm is a step-by-step procedure or formula used to train a model. It defines how the model learns from data and updates its parameters to make accurate predictions Key Algorithms: 1. Linear Regression: Used for predicting a continuous outcome based on linear relationships between features. 2. Decision Trees: Used for classification and regression by splitting data based on feature values 3. Support Vector Machines (SVM): Used for classification by finding the optimal hyperplane that separates different classes.\ 4. Neural Networks: Used for complex tasks like image recognition and natural language processing by mimicking the structure of the human brain. 16 Introduction to Data Science, AI & Machine Learning Performance & Optimizations Once multiple models are created, they need to be evaluated and optimized to ensure the best performance. This involves a. Performance Metrics: Assessing models using metrics like accuracy, precision, recall, F1-score, or mean squared error (MSE) to determine how well they perform. b. Cross-Validation: Splitting data into training and testing subsets multiple times to ensure that the model generalizes well to unseen data. c. Hyperparameter Tuning: Adjusting the settings of the algorithm (e.g., learning rate, number of trees) to improve model performance. d. Feature Selection: Choosing the most relevant features to include in the model to enhance accuracy and reduce overfitting. Optimization Techniques: Grid Search: Systematically testing a range of hyperparameter values to find the best combination. Random Search: Testing a random subset of hyperparameter values to find optimal settings more efficiently. 17 Introduction to Data Science, AI & Machine Learning 1.4. Data Visualization & Storytelling Learning Objectives Learn effective data visualization techniques, understand the importance of storytelling with data, and explore various data visualization tools e.g Tableau, Power BI or Alteryx. Communicating Insights Communication and visualization are essential for translating complex data analysis into clear, actionable insights. Effective communication ensures that findings are presented in a way that is understandable and relevant to stakeholders, while visualization helps to visually represent data, making it easier to identify patterns and make informed decisions. Both steps are integral to making data science outcomes accessible and useful for decision-making. a. Reports: Structured documents that summarize analysis, methodologies, results, and recommendations. These can be formal reports or executive summaries. b. Presentations: Visual and verbal summaries of the analysis, often delivered to stakeholders, decision-makers, or team members. Tools like PowerPoint or Google Slides are commonly used. c. Storytelling: Crafting a narrative around the data to make the findings more relatable and understandable. This involves framing data insights in a way that resonates with the audience's needs and interests. d. Meetings and Discussions: Engaging in conversations with stakeholders to explain findings, answer questions, and discuss implications and next steps. Components of Storytelling & Data Visualizations 1. Context 2. Selecting and effective visuals 18 Introduction to Data Science, AI & Machine Learning 3. Avoid clutter 4. Seeking audience attention 5. Designer approach 6. Organizing storyline 19 Introduction to Data Science, AI & Machine Learning 1.5. Introduction to BigData Learning Objectives Understand the characteristics of big data,explore relevant technologies, and delve into the challenges and opportunities it brings. Facebook is generating approximately 500 terabytes of data per day, about 10 terabytes of sensor data are generated every 30 minutes by airlines, the NSE Stock Exchange is generating approximately 1 terabyte of data per day, are few examples of BigData. Introduction to BigData Large , diverse set of information that can grow at ever increasing speed.can not be loaded in a single machine due to its size. Increasing a memory can be one of the alternatives for its not the ideal option due to required data processing and computational requirements. 5V's in Big Data Big data is defined by 5V's, which refers to the volume, Variety, value, velocity, and veracity. Let's discuss each term individually. Data is coming from various sources such as social media sites, e-commerce platforms, new sites, financial transactions. It can be audio, video, text, emails, transactions, and many more. 20 Introduction to Data Science, AI & Machine Learning Although storing raw data is not difficult, converting unstructured data into a structured format and making them accessible for business uses is practically complex. Smart sensors, smart metering, and RFID tags make it necessary to deal with huge data influx in almost real-time. Sources of data in Big Data Big data can be of various formats of data either in structured as well as unstructured form, and comes from various different sources. The main sources of big data can be of the following types: a. Social Media b. Cloud Platforms c. IoT, Smart Sensors, RFID d. Web Pages e. Financial Transactions f. Healthcare and Medical Data g. Satellite Big Data can be categorized as structured, unstructured, and semi-structured data.It is also helpful in areas as diverse as stock marketing analysis, medicine & healthcare, agriculture, gambling, environmental protection, etc. The scope of big data is very vast as it will not be just limited to handling voluminous data; instead, it will be used for optimizing the data stored in a structured format for enabling easy analysis. BigData Infrastructure Hadoop is an open source framework based on Java that manages the storage and processing of large amounts of data for applications. Hadoop uses distributed storage and parallel processing to handle big data and analytics jobs, breaking workloads down into smaller workloads that can be run at the same time Four modules comprise the primary Hadoop framework and work collectively to form the Hadoop ecosystem: Hadoop Distributed File System (HDFS) Yet Another Resource Negotiator (YARN) MapReduce Hadoop Common 1. HDFS (Hadoop distributed file systems) (for Storing) -HDFS is a distributed file system in which individual Hadoop 21 Introduction to Data Science, AI & Machine Learning nodes operate on data that resides in their local storage. This removes network latency, providing high-throughput access to application data. 2. Yet Another Resource Negotiator: YARN is a resource-management platform responsible for managing compute resources in clusters and using them to schedule users’ applications. It performs scheduling and resource allocation across the Hadoop system. 3. Map Reduce (for processing): In the MapReduce model, subsets of larger datasets and instructions for processing the subsets are dispatched to multiple different nodes, where each subset is processed by a node in parallel with other processing jobs. After processing the results, individual subsets are combined into a smaller, more manageable dataset 4. Hadoop Common: Hadoop Common includes the libraries and utilities used and shared by other Hadoop modules. Hadoop tools Hadoop has a large ecosystem of open source tools that can augment and extend the capabilities of the core module. Beyond HDFS, YARN, and MapReduce, the entire Hadoop open source ecosystem continues to grow and includes many tools and applications to help collect, store, process, analyze, and manage big data. Some of the main software tools used with Hadoop include: a. Apache Hive: A data warehouse that allows programmers to work with data in HDFS using a query language called HiveQL, which is similar to SQL b. Apache HBase: An open source non-relational distributed database often paired with Hadoop c. Apache Pig: A tool used as an abstraction layer over MapReduce to analyze large sets of data and enables functions like filter, sort, load, and join d. Apache Impala: Open source, massively parallel processing SQL query engine often used with Hadoop 22 Introduction to Data Science, AI & Machine Learning e. Apache Sqoop: A command-line interface application for efficiently transferring bulk data between relational databases and Hadoop f. Apache ZooKeeper: An open source server that enables reliable distributed coordination in Hadoop; a service for, "maintaining configuration information, naming, providing distributed synchronization, and providing group services" g. Apache Oozie: A workflow scheduler for Hadoop jobs BigData Storage Architecture 23 Introduction to Data Science, AI & Machine Learning 1.6. Descriptive Statistics Learning Objectives Learn about basic statistical concepts and Understand the different types of descriptive statistics Explore how descriptive statistics are used in data analysis Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, and presenting data. It helps in making informed decisions based on data. Statistics is broadly divided into descriptive and inferential statistics. 1) Descriptive Statistics: Descriptive statistics that summarize various attributes of a variable such as the average value or the range of values, mean, mode median , trends and distributions. a) A survey of students’ grades shows an average score (mean) of 75%. Descriptive statistics help summarize this information without drawing conclusions about the entire population.. 2) Inferential Statistics: Involves making predictions or inferences about a population based on a sample. It includes methods like hypothesis testing, confidence intervals, and regression analysis. a) After surveying 100 students, you infer that the average score of all students at the university is likely 75% Methods for making statements about data with confidence : Inferential statistics cover ways of making confident statements about populations using sample data a) Confidence intervals : allows us to make statements concerning the likely range that a population parameter (such as the mean) lies within b) Hypothesis tests : A hypothesis test determines whether the data collected supports a specific claim c) Chi-square : procedure to understand whether a relationship exists between pairs of categorical variables d) analysis of variance : determines whether a relationship exists between three or more group means 3) Comparative statistics allows us to understand relationships between variables. If you want to know the average height of students at your university, you could measure the height of a sample of students (e.g., 100 students) and use inferential statistics to estimate the average height of all students (the population). Frequency Distributions The concept of frequency distribution stands as a fundamental tool for data analysis, offering a window into the underlying patterns and trends hidden within raw data. 24 Introduction to Data Science, AI & Machine Learning Frequency distribution is a methodical arrangement of data that reveals how often each value in a dataset occurs Leveraging Python to construct and analyze frequency distributions adds a layer of efficiency and flexibility to the statistical analysis process. With libraries such as Pandas for data manipulation, Matplotlib and Seaborn for data visualization, Python transforms the way data scientists and statisticians approach frequency distribution, making it easier to manage data, perform calculations, and generate insightful visualizations. Histograms A histogram is a graphical representation of the distribution of numerical data. It consists of bars where the height of each bar represents the frequency of data points within a specific range or bin. Analyzing Frequency Distribution with Pandas Pandas is exceptionally well-suited for creating frequency distributions, especially with its DataFrame structure that makes data manipulation intuitive. Loading Data First, load your dataset into a Pandas DataFrame, For example, let’s use the Iris dataset available through Seaborn. Creating Frequency Distributions 25 Introduction to Data Science, AI & Machine Learning For discrete data, use the `value_counts()` method to generate a frequency distribution. For continuous data, you can categorize the data into bins using the `cut()` function and then apply `value_counts()`. Sample & Population A population includes all of the elements from a set of data while a sample consists of one or more observations from the population. A measurable characteristic of a population, such as a mean or standard deviation, is called a parameter; but a measurable characteristic of a sample is called a statistic. Why do we need samples? Sampling is done because one usually cannot gather data from the entire population. Data may be needed urgently, and including everyone in the population in your data collection may take too long. More than 500 million people voted in India in the 2024 general elections. If any agency had to conduct an exit poll survey, by no means they can do it by reaching out to all the voters. 26 Introduction to Data Science, AI & Machine Learning Module 2 : AI and Machine Learning 27 Introduction to Data Science, AI & Machine Learning 2.1. Introduction to AI & Machine Learning Learning Objectives Define machine learning and generative AI, grasp their key concepts and distinctions, and explore their applications across different fields Machine Learning (ML) is a subset of artificial intelligence (AI) that involves the development of algorithms and statistical models that enable computers to perform tasks without explicit instructions. Instead of being programmed with specific rules to follow, these models learn from data and identify patterns, make decisions, or predictions based on new data. It enables systems to improve over time and make intelligent decisions based on data AI , Machine Learning and Deep Learning Deep learning, is a subset of machine learning, embraces the mechanism of neural networks for more complex problem-solving. It takes inspiration from the human brain, it uses vast numbers of layered algorithms, hence the term “Deep” to simulate the intricate structure of the human brain Generative AI uses machine learning models trained on large datasets to generate new content like text,images , videos, music and code Everyday Examples of Machine Learning in Our Lives 1. Apps like Google Maps use machine learning to provide real-time traffic updates, suggest optimal routes, and estimate travel times based on historical and current traffic data. 28 Introduction to Data Science, AI & Machine Learning 2. Platforms like Netflix and Spotify use machine learning to analyze your viewing or listening history and recommend movies, TV shows, or music tailored to your preferences. 3. E-Commerce: Websites like Amazon and Flipkart use algorithms to suggest products based on your browsing history and previous purchases. 4. Virtual Assistants: Siri, Google Assistant, and Alexa utilize natural language processing (NLP) and machine learning to understand and respond to your voice commands, answer questions, and perform tasks like setting reminders or controlling smart home devices. 5. Spam Detection: Email services use machine learning to identify and filter out spam or phishing emails, based on patterns and characteristics learned from previous emails. 6. Social Media: Platforms like Facebook, Instagram, and Twitter use machine learning to analyze your interactions and display posts, ads, and stories that match your interests 7. Smartwatches and fitness trackers use machine learning to analyze your activity data, monitor health metrics, and provide insights into your fitness and well-being. 8. Chatbots: Customer service bots on websites and apps use machine learning to understand and respond to customer inquiries, providing instant support and information 9. Photo Organization: Apps like Google Photos use machine learning to categorize and search for photos based on content, such as people, places, or objects. 29 Introduction to Data Science, AI & Machine Learning 10. Self-Driving Cars: Companies like Tesla use machine learning to enable autonomous vehicles to navigate roads, detect obstacles, and make driving decisions based on sensor data. 11. Duolingo : Duolingo is a popular language learning app that leverages machine learning to enhance its language education services 12. Security Systems: Smartphones and security cameras use deep learning for facial recognition to unlock devices or identify individuals in surveillance footage. 13. Autocorrect and Predictive Text: Smartphones and email applications use NLP to suggest words or correct spelling and grammar as you type, enhancing typing efficiency What the heck is Machine Learning Models? A machine learning model is like a computer program that learns from data.instead of being explicitly programmed to do a task , it learns by finding the patterns in the data. Machine Learning is like teaching a computer to learn from examples and make decisions or prediction based on that learning For example, if you have pictures of cats and dogs , a machine learning model can learn to distinguish between them by analyzing features like fur texture, ear , shape etc 30 Introduction to Data Science, AI & Machine Learning Once the model has learned from many examples, you can give it new, unseen data ( like a new picture of a cat it has never seen). The model uses what it learned to make predictions or decisions, like predicting whether a given image is of cat or dog. The generalised flow can be understood as depicted in the following diagram: How Machine Learning Models Work Say a Multinational Bank wants to improve its loan approval process by using machine learning. ……………….. Types of Machine Learning Machine Learning can be supervised, semi-supervised, unsupervised and reinforcement Building Blocks of Machine Learning Algorithms a. A Loss Function b. An Optimization Criterion based on the loss function ( a cost function for example) 31 Introduction to Data Science, AI & Machine Learning c. An Optimization routine leveraging Machine Learning Frameworks Machine learning (ML) is the process of creating systems that can learn from data and make predictions or decisions. ML is a branch of artificial intelligence (AI) that has many applications in various domains, such as computer vision, natural language processing, recommender systems, and more. To develop ML models, data scientists and engineers need tools that can simplify the complex algorithms and computations involved in ML. These tools are called ML frameworks, and they provide high-level interfaces, libraries, and pre-trained models that can help with data processing, model building, training, evaluation, and deployment. Most of these are Python machine learning frameworks, primarily because Python is the most popular machine learning programming language 1. Azure ML Studio : Azure ML Studio allows Microsoft Azure users to create and train models, then turn them into APIs that can be consumed by other services. Users get up to 10GB of storage per account for model data, although you can also connect your own Azure storage to the service for larger models. A wide range of algorithms are available, courtesy of both Microsoft and third parties. https://www.datacamp.com/tutorial/azure-machine-learning-guide 32 Introduction to Data Science, AI & Machine Learning 2. Scikit-learn: A Python library that supports both unsupervised and supervised learning. If you’re new to machine learning, Scikit-learn is a great choice. It’s effective for predictive data analysis and feature engineering. 3. PyTorch: A customizable option that uses building classes. If you’re a Python developer searching for a framework with a shorter learning curve, Pytorch, when stacked against other frameworks, might be the one for you. Additionally, it’s an open-source, deep-learning framework. 4. TensorFlow: A popular end-to-end machine learning platform that offers feature engineering and model serving. Finally, if you need a framework with robust scalability that works across a wide range of data sets, TensorFlow is a good choice. https://www.linkedin.com/pulse/popular-ml-frameworks-train-your-models-what-choose-vish nuvaradhan-v-q98oc What is the difference between library and framework? A library performs specific, well-defined operations. Whereas a framework is a skeleton where the application defines the "meat" of the operation by filling out the skeleton. The skeleton still has code to link up the parts but the most important work is done by the application. What is MLOps? MLOps stands for Machine Learning Operations. MLOps is a core function of Machine Learning engineering, focused on streamlining the process of taking machine learning models to production, and then maintaining and monitoring them. MLOps is a collaborative function, often comprising data scientists, devops engineers, and IT. Components of MLOps 33 Introduction to Data Science, AI & Machine Learning MLOps Cycle Role of MLOps and ML Engineer An MLOps Engineer automates, deploys, monitors, and maintains machine learning models in production, ensuring scalability, performance, security, and compliance while collaborating with cross-functional teams to streamline the ML lifecycle and optimize infrastructure and costs. In contrast, a Machine Learning Engineer primarily develops, trains, and fine-tunes models, focusing on the core algorithms, data preprocessing, and feature engineering. While MLOps Engineers handle the operationalization and lifecycle management of models, ML Engineers concentrate on model development and experimentation. 34 Introduction to Data Science, AI & Machine Learning 35 Introduction to Data Science, AI & Machine Learning 2.2. Supervised Learning Learning Objectives Understand supervised learning algorithms, including regression and classification problems, and explore evaluation metrics for these models. Training a machine learning task for every input with a corresponding target, is called supervised learning. In supervised learning, the dataset is the collection of labeled examples, feature and Target Variables A feature vector is a vector in which each dimension j = 1,... ,D contains a value that describes the example. The value is called a feature and is denoted as x(j). The goal of a supervised learning algorithm is to use the dataset to produce a model that takes a feature vector x as input and outputs information that allows deducing the label for this feature vector. There are some very practical applications of supervised learning algorithms in real life, includes Input Output Application 36 Introduction to Data Science, AI & Machine Learning Emails Spam (0/1) Spam Detection Audio Text Transcripts Speech Recognition English French MachineTranslations Image , Radar Info Position of the objects Self Driving Cars Specifications(Km, Fuel Type …) Price Prediction Price Regression Problems For a used car price prediction model, labelled data refers to a dataset where each of the records includes both the features (attributes) of the used cars and the corresponding target variable, which is the price. Features (Attributes): These are the input variables that the model uses to make predictions. Typical features for used car price prediction include from Name to Seats The price at which the car is being sold, usually represented as a continuous numerical value. This is called the target variable here.This is the output that the model is trying to predict for unseen features. Factors affecting the price of the car Predicting the price of a used car involves the following components: a. Independent variables i.e. Predictor or Feature Variables b. Dependent variables i.e. Target Variable or Outcome Predicting Values of continuous dependent variables using independent explanatory variables is called Regression. Form of regression that models linear relationship between dependent and independent variable 37 Introduction to Data Science, AI & Machine Learning Classification Problem A classification problem in machine learning is a predictive modeling task that involves predicting a class label for a specific example of input data. Here are some examples of classification problems: 1. Spam filtering: Classifying an email as spam or not spam 2. Handwriting recognition: Identifying a handwritten character as one of the recognized characters 3. Image classification: Classifying images into predefined classes, such as cats and dogs, or oranges, apples, and pears The prediction task is a classification when the target variable is discrete. There are four main classification tasks in Machine learning: binary, multi-class, multi-label, and imbalanced classifications. 1. Binary Classification In a binary classification task, the goal is to classify the input data into two mutually exclusive categories. The training data in such a situation is labeled in a binary format: true and false; positive and negative; O and 1; spam and not spam, etc. 2. Multi-Class Classification The multi-class classification, on the other hand, has at least two mutually exclusive class labels, where the goal is to predict to which class a given input example belongs to. 38 Introduction to Data Science, AI & Machine Learning 3. Multi-label classification Multi-label classification is a classification problem where each instance can be assigned to one or more classes. For example, in text classification, an article can be about 'Technology,' 'Health,' and 'Travel' simultaneously 4. Imbalanced Classification For the imbalanced classification, the number of examples is unevenly distributed in each class, meaning that we can have more of one class than the others in the training data. Let’s consider the following 3-class classification scenario where the training data contains: 60% of trucks, 25% of planes, and 15% of boats The imbalanced classification problem could occur in the following scenario: a. Fraudulent transaction detections in financial industries b. Rare disease diagnosis c. Customer churn analysis 5. 2.3. Unsupervised Learning Learning Objectives Understand unsupervised learning algorithms, explore clustering and dimensionality reduction techniques, and examine their applications.dels. In unsupervised learning, the dataset is a collection of unlabeled examples. 39 Introduction to Data Science, AI & Machine Learning Again, x is a feature vector, and the goal of an unsupervised learning algorithm is to create a model that takes a feature vector x as input and either transforms it into another vector or into a value that can be used to solve a practical problem. Some of the following Algorithms for Unsupervised Machine Learning 1. Clustering (k-means, hierarchical, DBSCAN), 2. Dimensionality Reduction, 3. Latent Semantic Analysis (LSA) For example, In clustering, the model returns the id of the cluster for each feature vector in the dataset. Use cases of Clustering in Business Customer Segmentation: In dimensionality reduction, the output of the model is a feature vector that has fewer features than the input x In outlier detection, the output is a real number that indicates how x is different from a typical example in the dataset. 40 Introduction to Data Science, AI & Machine Learning What is the difference between supervised and unsupervised machine learning? Supervised Model has a label, it predicts basis feature variable (independent variable) Unsupervised model doesn’t have labels Class of Customers for doing target and promotion it doesn’t tell the customer A belongs from Class 1 ( no label given ) Let us suppose it has created three clusters C1, C2, and C3 basis features variables, C1 people buy low-priced products whereas C2 people are buying expensive products that result in higher less along with low sales quantity. In Regression we predict continuous values, whereas in classification we predict discrete values / binary values. Predicting the price of a used car – basis some independent explanatory variables – mileage, km driven, model, etc. Price is a continuous variable and can be anything Classification: Yes No : Approval of Loan Reinforcement Learning Reinforcement learning is a subfield of machine learning where the machine “lives” in an environment and is capable of perceiving the state of that environment as a vector of features. PackMan Game 41 Introduction to Data Science, AI & Machine Learning. 42 Introduction to Data Science, AI & Machine Learning 2.4. Deep Learning Learning Objectives Understand the concept of deep learning, explore neural networks and their architectures, and examine various deep learning applications. Deep Learning , a subset of machine learning, embraces the mechanism of neural networks for complex problem solving. Deep Learning becomes particularly relevant for problems involving large and complex datasets, where traditional machine learning algorithms fall short. Problems involving audio, images etc where features of the data needs to be learned automatically. Deep Learning is an advanced machine learning technique that requires more computation and large amounts of training data to be able to scale and generalise well on complex data representations.The concept of deep learning takes inspiration from the human brain.It uses a vast number of layered algorithms, hence termed as “deep” to simulate the intricate structure of the human brain. Natural language processing Natural language processing is an important part of deep learning applications that rely on interpreting text and speech. Customer service chatbots, language translators, and sentiment analysis are all examples of applications benefitting from natural language processing. Self-driving vehicles Autonomous vehicles use deep learning to learn how to operate and handle different situations while driving, and it allows vehicles to detect traffic lights, recognize signs, and avoid pedestrians. 43 Introduction to Data Science, AI & Machine Learning Algorithms such as LaneNet are quite popular in the field of research to extract lane lines.Algorithms such as YOLO or SSD are very popular in this field Neural Networks It is often referred to as “artificial brains”; they are a vital part of machine learning and AI technology. Inspired by the biological neural network that makes up human brains, these powerful computational models leverage interconnected layers of artificial neurons or ‘nodes’. Nodes (artificial neurons) receive input data , process it with a set of predefined rules and pass the result to the next layer in the neural network. A neural network is typically organised into three essential layers, each layer contains multiple nodes which process the incoming data before passing it on.Also each of the connections between nodes carries a numerical weight that adjusts as the network learns during the training stage, affecting the importance of the input value. 1. Input Layer : Input layer is to receive as input, raw data attributes.The nodes of the input layer are passive, meaning they do not change the data and sent to hidden layers 44 Introduction to Data Science, AI & Machine Learning 2. Hidden Layers(s) : These layers do most of the computations required.They take the data from the input layer, process it , and pass it to the next layer. 3. Output Layers: it generates the final outputs Neural networks adjust weights between nodes through backpropagation to optimize performance over time. The Spectrum of ML and Data Science Roles a. Data Analysis / Modeling :Data analysis, feature engineering, model development and training, statistical analysis, experiment design. b. ML Services and Infrastructure:Training and Inference services, scalability, model deployment, API integration. c. Area of Specialization i. Generalist:Work on a variety of problem spaces, employ a broad range of ML techniques, and adapt to different requirements of the team. ii. Specialist:Deep expertise in the chosen domain (such as Natural Language Processing (NLP), Computer Vision (CV), or industry-specific areas like self-driving cars and robotics), advanced knowledge of domain-specific tools. 2.5. LLMs( Large Language Models) Learning Objectives 45 Introduction to Data Science, AI & Machine Learning Understand the principles of large language models (LLMs), explore their architectures, and examine their applications in natural language processing and AI. Generative AI Generative AI is a type of AI that can create new content such as text,images, voice and codes from scratch based on natural language inputs or “prompts”. it uses machine learning models powered by advanced algorithms and neural networks to learn from large datasets of the existing contents. Once a model is trained, it can generate new content similar to the content it was trained on. What is LLMs Lange Language Models are foundational machine learning models that use Deep Learning algorithms to process and understand natural language. These Models are trained on a massive amount of text data to learn patterns and identify the entity relationships in the language. Language models are computational systems designed to understand, generate, and manipulate human language. At their core, language models learn the intricate patterns , semantics, and contextual relationships within language through extensive training over vast values of text data. These models are equipped with the capacity to predict the next word in a sentence based on the preceding words ( with coherent and relevance). Source : https://www.coursera.org/learn/generative-ai-with-llms GPT-4 BLOOM FLAN UL2 Claude GPT-3 LaMDA GATO ChatGLM LLaMA MT-NLG Pathways Language Model FALCON LLaMA 2 Stanford Alpaca (PaLM) Mistral 7B LLaMA 3 46 Introduction to Data Science, AI & Machine Learning Text Generation before LLMs There were some traditional ways of predicting the next words given some context implementations in a language model before the advent of transformers. N-gram,Markov Models, RNN & LSTM ( Long Short Term Memory) In 2017, everything changed with the introduction of the transformer architecture, as described in the influential paper titled “Attention is All You Need” by Google and the University of Toronto. The transformer revolutionized generative AI by enabling efficient scaling on multi-core GPUs, parallel processing of input data, and harnessing larger training datasets. Its key breakthrough was the ability to learn and utilize attention mechanisms, allowing the model to focus on the meaning of the words being processed. What LLMs can be used for ? It is a language Model which is responsible for performing tasks such as text to text generation, text to image and image to text generations and code generations a. Text Classifications b. Text Generation c. Text Summarization d. Conversational AI - Chat BOT e. Question Answering 47 Introduction to Data Science, AI & Machine Learning f. Speech Recognition g. Speech Identifications h. Spelling Corrector 48 Introduction to Data Science, AI & Machine Learning 2.6 Prompt Engineering Learning Objectives Understand the principles of prompt engineering, explore techniques for crafting effective prompts, and examine their impact on model performance and output Prompt engineering is a relatively new discipline for developing and optimizing prompts to efficiently use language models (LMs) for a wide variety of applications and research topics. Prompt engineering skills help to better understand the capabilities and limitations of large language models (LLMs). Prompt engineering is not just about designing and developing prompts. It encompasses a wide range of skills and techniques that are useful for interacting and developing with LLMs. It's an important skill to interface, build with, and understand capabilities of LLMs. LLM Settings When designing and testing prompts, you typically interact with the LLM via an API. You can configure a few parameters to get different results for your prompts. Tweaking these settings are important to improve reliability and desirability of responses and it takes a bit of experimentation to figure out the proper settings for your use cases. Below are the common settings you will come across when using different LLM providers a. Temperature b. Top P c. Max Length d. Stop Sequences e. Frequency Penalty f. Presence Penalty 49 Introduction to Data Science, AI & Machine Learning a. Temperature In short, the lower the temperature, the more deterministic the results in the sense that the highest probable next token is always picked. Increasing temperature could lead to more randomness, which encourages more diverse or creative outputs. You are essentially increasing the weights of the other possible tokens. In terms of application, you might want to use a lower temperature value for tasks like fact-based QA to encourage more factual and concise responses. For poem generation or other creative tasks, it might be beneficial to increase the temperature value b. Top P A sampling technique with temperature, called nucleus sampling, where you can control how deterministic the model is. If you are looking for exact and factual answers keep this low. If you are looking for more diverse responses, increase to a higher value. If you use Top P, it means that only the tokens comprising the top_p probability mass are considered for responses, so a low top_p value selects the most confident responses. This means that a high top_p value will enable the model to look at more possible words, including less likely ones, leading to more diverse outputs. The general recommendation is to alter temperature or Top P but not both. c. Max Length - You can manage the number of tokens the model generates by adjusting the max length. Specifying a max length helps you prevent long or irrelevant responses and control costs. d. Stop Sequences - A stop sequence is a string that stops the model from generating tokens. Specifying stop sequences is another way to control the length and structure of the model's response. For example, you can tell the model to generate lists that have no more than 10 items by adding "11" as a stop sequence. e. Frequency Penalty - The frequency penalty applies a penalty on the next token proportional to how many times that token already appeared in the response and prompt. The higher the frequency penalty, the less likely a word will appear again. This setting reduces the repetition of words in the model's response by giving tokens that appear more a higher penalty. f. Presence Penalty - The presence penalty also applies a penalty on repeated tokens but, unlike the frequency penalty, the penalty is the same for all repeated tokens. A token that appears twice and a token that appears 10 times are penalized the same. This setting prevents the model from repeating phrases too often in its response. If you want the model to generate diverse or creative text, you might want to use a higher presence penalty. Or, if you need the model to stay focused, try using a lower presence penalty Similar to temperature and top_p, the general recommendation is to alter the frequency or presence penalty but not both. Prompting an LLM 50 Introduction to Data Science, AI & Machine Learning A prompt can contain information like the instruction or question you are passing to the model and include other details such as context, inputs, or examples. You can use these elements to instruct the model more effectively to improve the quality of results. Few examples of Prompts for a. Text Summarization b. Information Extraction c. Question Answering d. Text Classification e. Conversation f. Code Generation g. Reasoning Text Summarization Example: Explain antibiotics A: Antibiotics are a type of medication used to treat bacterial infections. They work by either killing the bacteria or preventing them from reproducing, allowing the body’s immune system to fight off the infection. Antibiotics are usually taken orally in the form of pills, capsules, or liquid solutions, or sometimes administered intravenously. They are not effective against viral infections, and using them inappropriately can lead to antibiotic resistance. Explain the above in one sentence: Output: Antibiotics are medications used to treat bacterial infections by either killing the bacteria or stopping them from reproducing, but they are not effective against viruses and overuse can lead to antibiotic resistance. Text Classification Classify the text into neutral, negative or positive. Text: I think the food was okay. Sentiment: Output: Neutral Code Generation: One application where LLMs are quite effective is code generation Example: Prompt: """Table departments, columns = [DepartmentId, DepartmentName] Table students, columns = [DepartmentId, StudentId, StudentName] Create a MySQL query for all students in the Computer Science Department 51 Introduction to Data Science, AI & Machine Learning Output: SELECT StudentId, StudentName FROM students WHERE DepartmentId IN (SELECT DepartmentId FROM departments WHERE DepartmentName = 'Computer Science'); """ Elements of a Prompt A prompt contains any of the following elements: a. Instruction - a specific task or instruction you want the model to perform b. Context - external information or additional context that can steer the model to better responses c. Input Data - the input or question that we are interested to find a response for d. Output Indicator - the type or format of the output. Example Classify the text into neutral, negative, or positive Text: I think the food was okay. Sentiment: In the prompt example above, the instruction corresponds to the classification task, "Classify the text into neutral, negative, or positive". The input data corresponds to the "I think the food was okay.' part, and the output indicator used is "Sentiment: Standard Tips for Designing Prompts You can start with simple prompts and keep adding more elements and context as you aim for better results. Iterating your prompt along the way is vital for this reason a. Instruction You can design effective prompts for various simple tasks by using commands to instruct the model on what you want to achieve, such as "Write", "Classify", "Summarize", "Translate", "Order", etc. ### Instruction ### Translate the text below to Spanish: Text: "hello!" b. Specificity Be very specific about the instruction and task you want the model to perform. When designing prompts, you should also keep in mind the length of the prompt as there are limitations regarding how long the prompt can be. Example: Extract the name of places in the following text. Desired format: Place: Input: "Although these developments are encouraging to researchers, much is still a mystery. “We often have a black box between the brain and the effect we see in the 52 Introduction to Data Science, AI & Machine Learning periphery,” says Henrique Veiga-Fernandes, a neuroimmunologist at the Champalimaud Centre for the Unknown in Lisbon. “If we want to use it in the therapeutic context, we actually need to understand the mechanism." c. Impreciseness (lack of exactness) The more direct, the more effective the message gets across. Prompt Explain the concept of prompt engineering. Keep the explanation short, only a few sentences, and don't be too descriptive. 53 Introduction to Data Science, AI & Machine Learning References & Suggested Readings https://www.linkedin.com/pulse/popular-ml-frameworks-train-your-models-what-choose-v ishnuvaradhan-v-q98oc https://www.datacamp.com/tutorial/azure-machine-learning-guide https://www.linkedin.com/jobs/view/3992431030/ https://www.thinkautonomous.ai/blog/deep-learning-in-self-driving-cars/ ML Role Spectrum - Karthik Singhal,Medium, Meta Attention Is All You Need The Illustrated Transformer Transformer (Google AI blog post) 54