W2 - Introduction to DS AI (2).pptx
Document Details
Uploaded by FreeSet
University of Petra
2023
Tags
Full Transcript
Data Science Fundamentals Fall 2023 Dr. Sharif Makhadmeh Assistant Professor, Department of Data Science & Artificial Intelligence, University of Petra , Amman, Jordan What To Do Wit...
Data Science Fundamentals Fall 2023 Dr. Sharif Makhadmeh Assistant Professor, Department of Data Science & Artificial Intelligence, University of Petra , Amman, Jordan What To Do With These Data? Aggregation and Statistics Data warehousing and OLAP (online analytical processing), … Indexing, Searching, and Querying Keyword based search, Pattern matching (RDF/XML),… Knowledge discovery Data Mining, Statistical Modeling, prediction, classification,… Therefore, Data is very critical feature to study Since it helps business leaders to make decisions based on facts, statistical numbers and trends. Due to this growing scope of data, Data Science has emerged as a multidisciplinary field. What is Data Science (DS)? DS is an area that manages, manipulates, extracts, and interprets knowledge from massive amount of data (big data). Key sectors benefiting from DS : Corporations, Governments, Academia A huge field that uses methods and concepts belongs to other fields like info. science, statistics, mathematics, & comp. science. Examples of techniques utilized in DS: Machine Learning, Visualization, Pattern Recognition, Probability What is Data Science (DS)? Multidisciplinary field, uses scientific methods to draw insights from data. Extraction, preparation, analysis, visualization, and maintenance of information. Key sectors benefiting from DS : Corporations, Governments, Academia Common Tools of Data Science R- Scripting Language: Specifically tailored for statistical computing. Python-programming Language: Mostly used for DS & SW Development. SQL-Structured Query Language: used for managing & querying databases. MATLAB- Used in technical and engineering computing of MathWorks. Hadoop- Provides distributed storage for BD using a model called ‘MapReduce’. TABLEAU-Allows users to create interactive visualizations for graphical analysis. WEKA- used for Data Mining and Machine Learning operations. Many other tools such as: Excel, Apache Spark, BigML, D3.js, SAS, Jupyter… Data Scientists ? Job of the 21st Most attractive Century They find stories, extract knowledge. They are not reporters knowledge areas required for a solid background as data scientist Mathematics and Statistics Data Engineering Computing and Programming Skills Domain Expertise Concentrations in Data Science Mathematics and Applied Mathematics Applied Statistics/Data Analysis Solid Programming Skills (R, Python, SQL) Data Mining and Processing. Data Base Storage and Management Data Science Project Life Cycle A problem in DS can be solved by following steps: Define Problem Statement/ Business Requirement Data Collection Data Cleaning Data Exploration & Analysis Data Modelling Deployment & Optimization Data Science Examples and Applications Use machine learning to optimize energy usages. Optimize strategies to achieve goals in business and science. Identifying and predicting disease Automated intrusion detection/classification system. Fraud and Risk Detection. Image/Speech Recognition. Airline Route Planning. Healthcare recommendations. ….. Data Scientist vs Data Engineer vs Data Analyst Data Scientist Job: Data Engineer Job: Data Analyst Job: Creating models from the representation Interpreting current the data and make and movement of information to make it suggestions that are data so that it is useful for the business relevant to the consumable and Big Data (BD) and Three V’s Data exceeds processing capacity of conventional database systems. Traditionally, the term Big Data has been used to describe the massive volumes of data analyzed by huge organizations like Google or research science projects at NASA. Big Data is typically defined by three “V”s: volume, variety, velocity. Volume: size is too big - starts at Terabyte scales (1012 bytes) and has no upper limit. Velocity: means both how fast data is being produced and how fast the data must be processed (i.e., captured, stored, and analyzed) to meet the need or demand. Variety: Heterogeneous, diversity of sources, formats, quality, and structures. More about Hadoop BD can be handled of using traditional RDBMS due to the 3V’s. System Data engineers turn to Hadoop data processing platform (Typical BD solution). Hadoop divides BD into smaller datasets that are manageable to analyze. More about Hadoop System More Precisely, Hadoop is: Open Source framework writ ten in Java that allows distributed processing of big data-sets across the cluster of commodity hardware. In other words, rather than banging away at one huge block of data with a single machine, Hadoop breaks up Big Data into multiple parts so each part can be processed and analyzed at the same time. Open Source Source code is freely available/ It may be redistributed and modified. Distributed Processing Data is distributed on multiple nodes to be processed independently. Cluster Multiple nodes (machines) connected together via LAN (Local Area Network). Commodity Hardware Economic & affordable machines/ Typically low performance More about Hadoop System: Hadoop (V. 2 or +) platform composed of three main frameworks: Frameworks 1. HDFS for data storage: The default storage layer in any given Hadoop cluster. HDFS is responsible for storing large data sets of structured or unstructured data across various nodes and thereby maintaining the metadata in the form of log files. Used by SQL to query data from HDFS. Use tools like Hive/Spark SQL. 2. YARN for resource management. Yet Another Resource Negotiator: is the one who helps to manage the resources across the clusters. Implemented by JAVA Language More about Hadoop System: Hadoop (V. 2 or +) platform composed of three main frameworks: Frameworks 3. MapReduce for bulk/batch data processing. By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over the processing’s logic and helps to write applications which transform big data sets into a manageable one. Implemented by JAVA Language. More about Hadoop System: Hadoop Nodes More about Hadoop System: YARN + HDFS MapReduce The MapReduce system first reads the input file and splits it into multiple pieces. In this example, there are two splits, but in a real-life scenario, the number of splits would typically be much higher. These splits are then processed by multiple map programs running in parallel on the nodes of the cluster. The role of each map program in this case is to group the data in a split by color. The MapReduce system then takes the output from each map program and merges (shuffle/sort) the results for input to the reduce program, which calculates the sum of the number of squares of each color. More about Hadoop System: MapReduce Map the data:tag Reduce pairs into the data by (key, smaller sets of More about Hadoop System: Putting it all together What is Business Intelligence (BI)? BI concerns with performing descriptive analysis of data using technology and skills to make informed/ intelligent business decisions. The set of tools used for BI collects, governs, and transforms data. Main goal of BI is to derive actionable intelligence from data such as: Gaining a better understanding of the market Uncovering new revenue opportunities Improving business processes Staying ahead of competitors BI vs. DS However, both fields are heavily based on methods of Data Analytics, Data Wrangling, and Machine Learning to produce insights from Raw More about Data Analytics (DA) DA concerns of Converting Raw Data into Actionable Insights. More about Data Wrangling (DW) DW refers to modifying and summarizing data. Examples of DW processes: Data Extraction. Data Cleaning. Data Transformation, Data Aggregation. Data Organization. Data Sorting. Data Validation. Making business value from machine learning methods Linear regression Useful for making predictions for sales forecasts, pricing optimization, marketing optimization, and financial risk assessment. Logistic regression Useful for predicting customer churn, to predict response-versus-ad spending, to predict the lifetime value of a customer, and to monitor how business decisions affect predicted churn rates. Naïve Bayes. Useful forbuilding a spam detector, analyze customer sentiment, or automatically products, customers, or categorize competitors. Making business value from machine learning methods K-means clustering. Useful for cost modeling and customer segmentation (marketing optimization purposes). Hierarchical clustering. Useful to model business processes, or to segment Making business value from machine learning methods k-nearest neighbor classification. Useful for text document classification, financial distress prediction modeling, and competitor analysis and classification. Principal component analysis. A dimensionality reduction method used for detecting fraud, speech recognition, and