Q.E. Reviewer - AST01-AST15-part-8.pdf

Full Transcript

AST09 - Data Science and Analysis Data Science - Data science combines math and statistics, specialized programming, advanced analytics, artificial intelligence (AI), and machine learning with specific subject matter expertise to uncover actionable insights hidden in an organizat...

AST09 - Data Science and Analysis Data Science - Data science combines math and statistics, specialized programming, advanced analytics, artificial intelligence (AI), and machine learning with specific subject matter expertise to uncover actionable insights hidden in an organization’s data. 4 Pillars of Data Science: 1. Domain Knowledge - understanding the main problem. 2. Match & Statistics Skills - applying the right approach based on the problem. 3. Computer Science - understanding how computers work (Python, R, SQL) 4. Communication & Visualization - ability to showcase insights. Ecosystem of Data Science: 1. Business Understanding - identifying the central objectives; evaluates success 2. Data Mining - retrieval of data from different sources 3. Data Cleaning - preparation and removal of outliers 4. Data Exploration - understanding the patterns and biases of data 5. Predictive Modeling - select models that fits the problem 6. Data Visualization - communicates insights in pleasing way Data Analytics - Data analytics is the process of analyzing raw data in which we can pull out insights that will be useful. It focuses more on viewing the historical data in context. It is more specific and concentrated than data science. It is also used to examine large data sets to identify trends, develop charts, and create visual presentations. Big Data Analytics - It is the use of advanced analytic techniques against very large, diverse big data sets that include structured, semi-structured and unstructured data, from different sources, and in different sizes from terabytes to zettabytes. 5V’s of Big Data Analytics: 1. Velocity - refers to the speed of generation of data. 2. Value - refers to the value that big data can provide, and it relates directly to what organizations can do with that collected data. 3. Veracity - refers to the quality and accuracy of the data. It is important because the data can be collected from a variety of sources and it makes the data vulnerable to inconsistencies and uncertainty. 4. Variety - refers to the nature of the data. This data might be structured, semi-structured, and unstructured data. Also, variety refers to heterogeneous sources. 5. Volume - refers to the amount of information. It plays a crucial part in determining the relevance and importance of the data. Types of Analytics: 1. Descriptive Analytics - looks at what happened in the past. Its purpose is to simply describe what has happened. 2. Diagnostic Analytics - seeks to delve deeper in order to understand why something happened. The main purpose of diagnostic analytics is to identify and respond to anomalies within your data. 3. Predictive Analytics - seeks to predict what is likely to happen in the future. 4. Prescriptive Analytics - looks at what has happened, why it happened, and what might happen in order to determine what should be done next. It enables us to see how each combination of conditions and decisions might impact the future, and allows us to measure the impact a certain decision might have. Data Science Process: 1. Data Gathering or Acquisition 2. Data Preparation Data Cleaning - process of removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted in order to prepare it for analysis. Data Transformation - Data transformation is the process of changing data from one format to another, usually from a source system's format to a destination system's needed format. 3. Data Modeling - It is the act of generating a visual representation of an entire information system or parts of it in order to express linkages between data points and structures. - Its purpose is to show how many types of data are used and stored in the system as well as how the data can be categorized and arranged. Data Model Performing Tools 1. Python: It is an interpreted high-level general-purpose programming language. Its language constructs as well as its object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects. List of some of the libraries that is used in Python Programming: Scipy (Scientific Python) Ramp PyTorch SQLAlchemy Seaborn Theano Scrapy Statsmodel SymPy BeautifulSoup TensorFlow Scikit- learn PyGame 2. R: R is a programming language and free software environment for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. It is widely used among statisticians and data miners for developing statistical software and data analysis. 3. SAS: SAS is a statistical software suite developed by SAS Institute for data management, advanced analytics, multivariate analysis, business intelligence, criminal investigation, and predictive analytics. Analytic Strategy - The process of gathering data that determines the objectives or goals that you want to achieve either by your hypothesis or theory. Common Advanced Data Analytics Method: 1. Association Rule Learning Analysis - it includes machine learning methodologies exploiting rule-based learning methods to identify relationships among variables in large datasets. 2. Classification Tree Analysis - used to model time-to-event data 3. Decision Tree Algorithms - set of data mining techniques used to identify classes and/or predict behaviors from data. 4. Regression Analysis - used to predict relationships between variables. 5. Genetic Algorithms - it is an adaptive heuristic search algorithm inspired by "Darwin's theory of evolution in Nature." It is used to solve optimization problems in machine learning. Visualization: 1. Association Rule Learning 2. Classification and Regression Analysis (Difference) 3. Decision Trees Analysis 4. Genetic Algorithms Artificial Neural Networks (ANN’s) - It is a machine learning program, or model, that makes decisions in a manner similar to the human brain, by using processes that mimic the way biological neurons work together to identify phenomena, weigh options and arrive at conclusions. Neurons in ANN are known as nodes. DATA PREPARATION, MODEL PLANNING, & MODEL BUILDING Data Preparation - The preparation of data involves some cleaning as well as choosing some appropriate samples for training and testing. Also, any appropriate combining or aggregating of datasets or elements is done during this stage. - ETLT is performed in which Extraction, Transformation, and Loading (ETL) of the data and Extract, Load and Transform (ELT) of the data is executed by the teams to get data into the sandbox for analysis. Preparing the Analytical Sandbox - Realizing the analytical sandbox is the very first subphase of data planning. - The data preparation process includes the existence of an analytical sandbox (workspace) in which the team can work with the data for the duration of the project and conduct analytics. Performing ETL (Extraction, Transformation, and Loading) - ETL consists of process series and application series. To build a data center, it is done to combine all the data coming from various sources. It consumes quite a sizable amount of effort to develop a data warehouse. TOOLS FOR DATA PREPARATION Hadoop: It allows data scientists to explore the complexities that exist in the data. It also allows the data scientists to store data where it is, and that is the whole point of data discovery. Alpine Miner: This tool includes a graphical user interface (GUI) to develop analytical workflows, including data manipulation and a sequence of analytical events on PostgreSQL and other big data sources, such as structured data mining techniques (e.g., first select the top 100 customers, and then run descriptive statistics and clustering). OpenRefine: It is a standalone open-source tool for data cleanup and transformation to other formats, called data wrangling. This tool is mostly used for cleaning messy data, transformations of data, parsing data from websites, etc. Data Wranglers (Trifacta Wrangler): Trifacta Wrangler empowers analysts to wrangle various data sources on their desktop in preparation for use in analytical or visualization tools such as Tableau. Data Model Planning - The model planning phase is necessary to perform extra data exploration, data conditioning, and transformations to prepare the data for the model building phase. Some of the activities to be considered in this phase are as follows: - Assessing the structure of the datasets: The structure of the datasets is to be studied properly. - Ensure that the analytical technique will meet the goals and objectives the team is trying to achieve. - In some cases, a single model does not suffice the requirements. Therefore, a series of techniques as part of the large analytical work-flow is needed. SUBSTEPS FOR MODEL PLANNING Data Exploration and Variable Section - The main objective of exploring the data is to know the relationships among the variables. Model Selection - The main goal of this substep is to choose an analytical technique or a short-list of candidate techniques based on the end of the project or the purpose of analysis, for example, exploratory or prediction. COMMON TOOLS FOR THE MODEL PLANNING PHASE R - It is the leading analytical method in the industry and is widely used to process statistics and data. This can manipulate the data easily and show it in a variety of ways. R also provides tools for the automated installation of all packages as per user requirement, which can be conveniently installed with big data, too. Tableau - Tableau Public is a free program that links any data source, whether it be a corporate data warehouse, Microsoft Excel, or web-based data, and generates visualizations of results, charts, dashboards, etc., with web-based real-time updates. Also, the big data capabilities of Tableau make them important, and one can better analyze and visualize the data than any other business program for data visualization. SAS - SAS is easy to access, is easy to handle, and can analyze data from any source. This can also predict, monitor, and refine their social behaviors. Data Model Building - In the model building phase, the selected analytical technique is applied to a set of training data. This process is known as “training the model”. - A separate set of data, known as the testing data, is then used to evaluate how well the model performs. It is sometimes known as the Pilot Test. COMMON TOOLS USED FOR MODEL BUILDING PHASE SAS Apache Spark BigML MATLAB Jupyter Scikit TensorFlow Weka Big Data - Big data is similar to small data, but bigger in size. - Large data requires different approaches: Techniques Tools Architecture - It aims to solve new or old problems in a better way. - It generates value from storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques. Visualization on “How Big is Big Data”: 5V’s of Big Data: 1. Volume 2. Velocity 3. Variety 4. Veracity 5. Value SMALL DATA VS BIG DATA SMALL DATA BIG DATA Low Volumes Into Petabyte Volumes Batch Velocities Real-time Velocities Structured Varieties Multi-Structured Varieties SOURCES OF BIG DATA Media: Media communication and outlets. Articles, podcasts, video, audio, email, and blogs. Social: Digital material created by social media. Text, photos, videos, and tweets. Machine: Data generated by computers and machines generally without human intervention. Business process logs, sensors, and phone calls Historical: Data about our environment and archived documents, forms or records. Weather, traffic, and census OBJECTIVES OF BIG DATA 1. Analyzing customer behavior 2. Combining multiple data sources 3. Improving customer service 4. Generate additional revenue 5. Be more responsive to the market DATA ANALYTICS PRACTICE 1. Machine Learning 2. Simulation 3. Time Series 4. Signal Processing 5. Natural Language Processing 6. Crowdsourcing 7. Data Fusion 8. Data Integration 9. Genetic Algorithm Big Data Analytics - Big Data analytics is a process used to extract meaningful insights, such as hidden patterns, unknown correlations, market trends, and customer preferences. - Big Data analytics provides various advantages, it can be used for better decision making, preventing fraudulent activities, among other things. Advantages of Big Data Analytics: - Examining large amount of data - Appropriate information (about data) - Identification of hidden patterns, unknown correlations - Better business decisions: strategic and operational - Effective marketing, customer satisfaction, increased revenue TYPES OF TOOLS USED IN BIG DATA Distributed servers or clouds: this is where the processing of big data is hosted. Distributed storage: this is where the data is stored. Distributed processing: this is the programming model used in big data processing. High-performance schema-free databases: this is how the data is stored and indexed. Analytics or Semantic Processing: this is the kind of operation performed on the data. APPLICATION OF BIG DATA ANALYTICS IMPACTS OF BIG DATA Healthcare: It allows us to find new cures and better understand and predict disease patterns. Science: It creates new possibilities and ways to conduct research which would otherwise be impossible, helping us to make new discoveries. Security: Police forces use big data tools to predict criminal activities, conduct investigations and ultimately to catch criminals faster. Business: It helps us to improve and optimize the ways we do business by making data-driven decisions. RISKS OF BIG DATA BENEFITS OF BIG DATA Physical Parameters Constants: Sun’s Density: 1.41 g/cm^3 Sun’s Absolute Magnitude: 4.83 Solar Radius: 0.8 Mean G_rp: 8.65 Formulas: Effective Temperature: Mstar: (4/3)(pi)(Solar Radius x 3)(Density of the Sun) Units: Density of the Parent Star (derive by Lo & Fu, 2022): s^2kg/d^2m^3 Mstar: s^2kg/d^2m^3 To access each module using google colaboratory, kindly click the links provided: Google Colab Modules Module I: https://colab.research.google.com/drive/1SFG_WnMMSd55wqLM6_TJbqVAU8Ozhthc?usp=sha ring Module II: https://colab.research.google.com/drive/1wR3O5NIlvk_siPgM2JJ2rMGghq_4gq7S?usp=sharing Module III: https://colab.research.google.com/drive/1t63nfrmEZCd4fOi_zPyaqZji6Na0Lwep?usp=sharing NOTE: FOR THOSE WHO WANTED TO PRACTICE THE GIVEN MODULES, KINDLY MAKE A COPY. THANK YOU!

Use Quizgecko on...
Browser
Browser