Data Analytics - Introduction PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides an introduction to data analytics, explaining types of data and data analysis techniques. It also outlines a grading scale and references various resources about the topic.
Full Transcript
INTRODUCTION TYPES OF DATA TYPES OF DATA ANALYSIS LIFE CYCLE OF DATA ANALYSIS ANALYTICAL TOOLS Grading Scale Midterm 40% Assignment 20% Project 40% Final %50 Midterm and final exams will be coverin...
INTRODUCTION TYPES OF DATA TYPES OF DATA ANALYSIS LIFE CYCLE OF DATA ANALYSIS ANALYTICAL TOOLS Grading Scale Midterm 40% Assignment 20% Project 40% Final %50 Midterm and final exams will be covering the materials from lectures and documents. https://github.com/needmukesh/Hadoop- Books/blob/master/book/Data%20Analytics%20with%20Hadoop%20- %20An%20Introduction%20for%20Data%20Scientists.pdf https://files.eric.ed.gov/fulltext/ED536788.pdf Data Mining Analysis and Concepts, M. Zaki and W. Meira (the authors have kindly made an online version available): http:/iwww.datamininqbook. infoluploads/book.pdf Mining of Massive Datasets Jure Leskovec Stanford Univ. Anand RajaramanMilliway Labs Jeffrey D. Ullman Stanford Univ. (hftp:llwww.vistrai ls.orglindex.phplCourse:_Big_Data_Analysis) Data is Raw facts and figures that need to be processed to extract meaningful information. The term big data refers to data sets that are so massive, so quickly built, and so varied that they defy traditional analysis methods such as you might perform with a relational database. Big data is often described in terms of the five V's; velocity, volume, variety, veracity, and value Examples: Numbers, text, images, sound, etc. The systematic process of inspecting, cleaning, transforming, and modeling data to discover meaningful information, draw conclusions, and support decision-making. is the process and method for extracting knowledge and insights from large volumes of disparate data. It's an interdisciplinary field involving probability, programming, mathematics, statistical analysis, data visualization, and more. lt's what makes it possible for us to appropriate information, see patterns, find meaning from large volumes of data and use it to make decisions that process business. Data Analysis Data Analytics The concept of "data" has become very valuable for companies, institutions and individuals today. Today; Data is also considered to be extremely decisive for giant companies such as Google and Microsoft. Companies generally prioritize data while analyzing and meticulously examining their customers' requests and preferences. They analyze this data to make improvements in their companies according to their customers' demands. In fact, they create their future strategies with data. They aim to predict the future of their companies with the data of the past and present. For this reason, processing, understanding and effectively analyzing and using data is very valuable for companies. Data Analysis Data Analytics If we explain the meanings of the words; We can state that the concept of analysis means examining the elements and structure of the subject in detail. The word analytics describes systematically calculated analyzes of data or statistics. Data analysis aims to obtain better results by processing data to make better decisions with better predictions. In this regard, data analysis forms a subset of data analytics. Data analysis allows us to understand the data by questioning it. Data analytics is done by examining past decisions for future decisions using certain methods. Data analysis helps us understand past information. Feature Data Analytics Data Analysis Data analytics has a broader scope and The scope of data analysis is narrower as it focuses on Scope encompasses the entire data lifecycle the in-depth examination and interpretation of (collection, cleaning, organization). prepared data. It focuses on deriving insights for decision- Understanding the data itself and uncovering patterns Focus making. and relationships. Utilizes a broader range of tools (data Primarily, it relies on techniques like time-series Techniques collection methods, cleaning tools, machine analysis, association rule mining, cluster analysis, etc. learning algorithms). Actionable recommendations, reports, and Descriptive and inferential statistics, visualizations, Output dashboards. and hypotheses. Data engineers, primarily business Users Data scientists, analysts, researchers, etc. stakeholders, managers, etc. Requires broader technical knowledge across It demands strong expertise in statistical modeling, Technical Skills data management stages. such as linear regression, logistic regression, etc. Can leverage machine learning for future It focuses on understanding historical data that may Predictive Power predictions. not involve complex machine learning techniques. Data analysis is a core component that comes under the field of analytics, focusing specifically on the in-depth examination of data. It's like taking those cleaned and organized gems unearthed by data analytics and putting them under a magnifying glass to understand their properties and significance. Processed data is information, processed information is knowledge, processed knowledge is wisdom. Ankala V. Subbarao ARTIFICIAL INTELLIGENCE ARTIFICIAL INTELLIGENCE HARDWARE TECHNOLOGIES DATA STRUCTURE TYPES SCHEMA FOR DIVERGENT ECOSYSTEM DATA INFRASTRUCTURE AND TOOLS DATA INFRASTRUCTURE AND TOOLS DATA INFRASTRUCTURE AND TOOLS DATA ANALYTICS WHERE DOES DATA COME FROM? Computer Generated Application server logs (web sites, games, internet) Sensor data (weather, atmospheric science, astronomy, smart grids) Images/videos (traffic, security cameras, military surveillance) Human Generated Blogs/reviews/emails/pictures/scientific research/medical records Social graphs: facebook, contacts, twitter DATA SOURCES DATA ANALYTICS Data analytics is a discipline that includes the management of the complete data lifecycle, which encompasses collecting, cleansing, organizing, storing, analyzing and governing data. DATA ANALYTICS CATEGORIES Data analytics enable data-driven decision-making with scientific backing so that decisions can be based on factual data and not simply on past experience or intuition alone. There are four general categories of analytics that are distinguished by the results they produce: descriptive analytics diagnostic analytics predictive analytics prescriptive analytics DATA ANALYTICS CATEGORIES 1- Descriptive Analytics: Answer questions about events that have already occurred. What was the sales volume over the past 12 months? What is the number of support calls received as categorized by severity and geographic location? What is the monthly commission earned by each sales agent? DATA ANALYTICS CATEGORIES 2- Diagnostic Analytics: Determine the cause of a phenomenon that occurred in the past using questions that focus on the reason behind the event. Why were Q2 sales less than Q1 sales? Why have there been more support calls originating from the Eastern region than from the Western region? Why was there an increase in patient re- admission rates over the past three Diagnostic analytics can result in data that is months? suitable for performing drilldown and roll-up analysis DATA ANALYTICS CATEGORIES 3- Predictive Analytics: Determine the outcome of an event that might occur in the future. What are the chances that a customer will default on a loan if they have missed a monthly payment? What will be the patient survival rate if Drug B is administered instead of Drug A? If a customer has purchased Products A and B, what are the chances that they will also purchase Product C? DATA ANALYTICS CATEGORIES 4- Prescriptive Analytics: It is build upon the results of predictive analytics by prescribing actions that should be taken. The focus is not only on which prescribed option is best to follow, but why. Among three drugs, which one provides the best results? When is the best time to trade a particular stock? Prescriptive analytics provide more value than any other type of analytics and correspondingly require the most advanced skillset, as well as specialized software and tools. DATA ANALYTICS CATEGORIES DATA APPLICATIONS Education Healthcare Government Entertainment and Media Weather Transportation Banking DATA IN EDUCATION Customized and Dynamic Learning Programs Customized programs and schemes to benefit individual students can be created using the data collected on the bases of each student’s learning history. This improves the overall student results. Reframing Course Material Reframing the course material according to the data that is collected on the basis of what a student learns and to what extent by real-time monitoring of the components of a course is beneficial for the students. Grading New advancements in grading systems have been introduced as a result of a proper analysis of student data. Career Prediction Appropriate analysis and study of every student’s records will help understand each student’s progress, strengths, weaknesses, interests, and more. It would also help in determining which career would be the most suitable for the student in future DATA IN HEALTHCARE No unnecessary diagnosis -> treatment cost reduce Epidemic prediction and help to decide preventative measures Detection of diseases in early stage Medical results of past medicines -> evidence-base more accurate prescription DATA IN GOVERNMENT Welfare Schemes In making faster and informed decisions regarding various political programs To identify areas that needs attention To stay up to date in the field of agriculture by keeping track of all existing land and livestock To overcome national challenges such as unemployment, terrorism, energy resources exploration.. Cyber Security Deceit recognition To catch tax evaders DATA IN ENTERTAINMENT AND MEDIA Predicting the interests of audiences Effective targeting of the advertisements Optimized or on-demand scheduling of media streams in digital media distribution platforms Getting insights from customer reviews DATA IN WEATHER Forecasting To study global warming In understanding the patterns of natural disasters To make necessary preparations in the case of crises To predict the availability of usable water DATA IN TRANSPORTATION Route planning: To estimate users’ needs on different routes and on multiple modes of transportation and then utilize route planning to reduce their wait time Congestion management and traffic control: Real-time estimation of congestion and traffic patterns. E.g. using Google Maps to locate the least traffic-prone routes Safety level of traffic: To use the real-time processing of big data and predictive analysis to identify accident- prone areas and help reduce accidents DATA IN BANKING Misuse of credit/debit cards Venture credit hazard treatment Business clarity Customer statistics alteration Money laundering Risk mitigation DATA CHALLENGES Capturing data Storing data Searching data Sharing data Transferring data Analysis of the previously stored data DATA CHALLENGES Storing exponentially growing huge datasets Data generated in past 2 years is more than the previous history in total By 2025, total digital data will grow to 44 zettabytes approximately By 2025, about 1.7MB of new info will be created every second for every person DATA CHALLENGES Processing data faster The data is growing at much faster rate than that of disk read/write speed Bringing huge amount of data to computation unit becomes a bottleneck Processing data having complex structure Structured Semi structured Unstructured HOW IS BIG DATA PROCESSED? Parallel processing Interconnected systems ROLES IN DATA WORLD MLOPS WORLD SOFTWARE DEVELOPMENT CYCLE DEVOPS ARCHITECTURE MLOPS AND DEVOPS DATA DRIVEN ORGANIZATION WHAT PIONEERS THINK DATA SOLUTIONS Complexity A REAL LIFE Data Analytics Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark-Cluster Apache Spark-MLlib Apache Spark-MLlib Apache Spark-MLlib Apache Spark RDD Structure Apache Spark RDD Structure Spark, which emerged as a solution to the performance costs caused by "Mapreduce" with its disk-based structure, can run faster than Apache Hadoop in big data applications with its in-memory data processing feature. Spark-Hadoop Basis Hadoop Spark Processing Hadoop’s MapReduce model reads Spark reduces the number of read/write Speed & and writes from a disk, thus cycles to disk and stores intermediate data slowing down the processing speed. in memory, hence faster-processing speed. Performance Hadoop is designed to handle batch Spark is designed to handle real-time data Usage processing efficiently. efficiently. Latency Hadoop is a high latency computing Spark is a low latency computing and can framework, which does not have an process data interactively. interactive mode. Data fragments in Hadoop can be Machine Spark is much faster as it uses MLib for too large and can create computations and has in-memory Learning bottlenecks. Thus, it is slower than processing. Spark. Streaming - It can be used in real time thanks to the spark streaming feature. It provides ease of processing on big It provides ease of processing on big data SQL data with its SQL structure. with its SQL structure. Apache Spark was developed as an alternative to the MapReduce feature available in Hadoop. A dataframe is an immutable distributed collection of data. Unlike an RDD, data is organized into named rows and columns like a table in a relational database. Resilient Distributed Dataset(RDD) is the underlying data unit in Apache Spark, which is a collection of items distributed across cluster nodes and can perform parallel operations. Apache Spark RDDs support two types of operations. Transaction: Creates a new RDD from the existing one. (map, filter, union, join, groupByKey, sortByKey) Action: Returns the final result to driver program or writes it to external data storage.( reduce, collect, count, distinct) Transaction methods are designed lazy; no action is taken until any Action method is called. Dividing data into small micro- batches allows calculatios to be allocated to resources precisely, meaning that the bottleneck problem that occurs in traditional systems is not encountered. Since the data is divided into micro-batches, in case of any node loss, the inaccessible data can be restarted in parallel to all other nodes in the cluster, thus distributing all nodes equally among all nodes and recovering from failure faster than the traditional approach. DStreams consist of constantly incoming RDDs. SPARK-MLLIB Create a Spark Project; 1- Add Spark Maven Dependency 2 - Create Spark Context 3- Upload the data to the RDD 4- Use According to Scenario Accounting Software Mail-server Software Good old days-earlier 2000's Accounting Software Mail-server Software Idle Capacity Total Capacity Used Capacity 4 Cores 8 Threas 2 Cores CPU 16 Gb Ram 4-8 Ram 1 TB Harddisk 200 GB Harddisk %70 idle capacity Virtualization Application Application Application Application Application Operating System Operating System Operating System Operating System Operating System Virtualization Software Physical Physical Hardware Hardware Physical Hardware Mail-server Software Accounting Software Mail-server Software Accounting Software Physical Server Failure-Virtualized environment Virtualization Software Company needs Mail Server Instant messaging Server File Server Planning Server Accounting Server Web Server Database Server Textıle Company IT Department ON PREMISE - LOCAL DATA CENTER High initial investment cost High maintenance cost Not expandable Not flexible Difficult to plan COLOCATION DATA CENTER Company A Company B Company C Cloud Computing When a company chooses to "move to the cloud", it means that the company's IT infrastructure data operated off-site by a cloud computing provider (e.g. Oracle) It means it is stored in the center. Cloud computing offers customers greater agility, scale and flexibility. Instead of wasting money and resources on legacy IT systems, customers can focus on more strategic tasks. Without a large upfront investment, they can quickly access the computing resources they need and pay only for the features they use. ADVANTAGES OF CLOUD COMPUTING Cloud computing offers a superior alternative to traditional information technologies, including: Cost – subtract capital expenses Speed – fast provisioning zone for development and testing Global scale – scale flexibly Productivity – increased collaboration, predictable performance and customer isolation Performance - better price/performance ratio for cloud-native workloads Reliability – fault tolerant, scalable, distributed systems across all services IAAS:INFRASTRUCTURE AS A SERVICES Instead of purchasing the hardware and costly tools they need separately, users access infrastructure services such as servers, networks, and storage through IaaS from companies that provide cloud services. Users can access these services via application programming interface (API). In this way, IaaS provides the service provided by traditional data centers as a virtual data center. In addition, IaaS is suitable for companies of all sizes because it provides full control of the infrastructure and implements a pay-as-you-go model. Amazon Web Service (AWS), EC2, DigitalOcean, Google Compute Engine (GCE) and Cisco Metacloud can be given as examples of companies that provide IaaS services. PAAS: PLATFORM AS A SERVICES PaaS offers software services to be used for applications, as well as features such as network and storage found in the IaaS service. So PaaS prepares an environment for creating software. In this way, developers do not have to deal with operating system and software updates etc. Additionally, PaaS allows the development, testing and deployment of applications. Examples of PaaS services are AWS Elastic Beanstalk, Heroku, Google App Engine. SAAS: SOFTWARE AS A SERVICES In addition to containing all of the SaaS, PaaS and IaaS services, it mostly offers application services that can run on a web browser. With SaaS service, you can perform any operation you want without the need for any software installation, management, etc. Examples for SaaS are Google Workspace, Dropbox, SAP Concur. Cloud Computing Cloud Computing Cloud Computing Cloud Providers Comparison Amazon Web Services Data Analytics Streaming Apache Kafka What Is Streaming Data? Streaming Overview Streaming Overview Streaming Overview Streaming Overview Batch Processing Data Source → File DB queue structure → Analyzing → Results Real-Time Processing Data Source → Analyzing →Results Example : Premature Babies Ontario Institute of Technology (UOIF) University uses Big Data Technologies for health control of newborn unit. Advantages: -detection of life-threatening conditions 24 hours earlier -improved patient services and low death rates Example: Real-time Message Analysis with Spark Streaming Example: Autonomous Vehicles and Real Time Streaming Example: Shopping Center and Real Time Analysis Spark Streaming Spark Streaming Real-time Message Analysis with Spark Streaming IoT Application Apache Kafka Apache Kafka Temperature Speed Movement Apache Kafka Apache Kafka Apache Kafka Apache Kafka Apache Kafka Apache Kafka Technical Details ZooKeeper is a distributed system coordinator. When brokers are disabled or added, it informs the connected Producers and Consumers and performs the necessary assignments. To explain it more theoretically; when any section in a broker fails, it is responsible for making one of the other followers of that section the leader. Topic is a user-defined category name. Whatever message will be published, Kafka keeps the messages in Topics. Producers are those who send messages to topics, and Consumers are those who listen to topics. Producers can decide which message goes to which section. They make this decision using a scheduling algorithm called round-robin or semantic Brokers are servers in groups. They form a cluster. partitioning. Since the communication between them is done via TCP protocol, it is language independent. A partition has a leader and the others (followers) follow this leader; the producer first writes the incoming data to Partition 0 in Broker 1 and then copies are taken in Broker 2, Broker 3 (replication factor). Then the data is written to Partition 1 and Partition 2. In other words, the data is divided into 3 different partitions. External read and write operations are only performed on the leader partition. The leader is also responsible for queuing all incoming write operations and propagating these write operations to other replicas (followers) in the same order. Apache Zookeeper Apache Zookeeper coordinates resources management in distributed server architecture. Generally, it is used for coordination and it held configuration files. Apache Kafka Apache Kafka Windows Installation To install Apache Kafka on our computer, we will first go to the Apache Kafka Download page. From here we will download the appropriate version. (kafka_2.11-1.1.0.tgz is my preference) We will unzip the downloaded file and copy it to the C: directory. We will update the server.properties file in the config folder; We will add the log.dirs=C:\kafka_2.10-0.10.1.0\kafka-logs section. The installation process is complete. Apache Zookeeper To run Apache Kafka Server, you must first run Zookeeper. For this process, we open the Windows Command Store (cmd). Type zkserver command and press enter This window remains open. Then we open a new Windows Command Line We come to the directory where Kafka is located. We write C:\kafka_2.10-0.10.1.0\bin\windows> kafka-server-start.bat..\..\config\server.properties A. Kafka Producer, Consumer and Topics Create Topic topic name List Topics A. Kafka Producer, Consumer and Topics topic name topic name Kafka and Java Integration- Writing Producer Example Kafka and Java Integration- Writing Consumer Example Data Analytics Python Python Very popular general-purpose programming language Used from introductory programming courses to production systems Software programmer Guido van Rossum from Netherlands in 1990 Name is given from a show called Flying Circus by English comedy group Monty Python Python supports: Structural programming Object oriented programming Functional programming Python Programming Many IDEs available or Notepad + Python interpreter or Anaconda which has Spyder and Jupyter Notebook software for Python programming Two versions of Python in use - Python 2 and Python 3 Python 3 not backward-compatible with Python 2 A lot of packages are available for Python 2 Check version using the following command $ python -- version Python Features Python programs are comparatively… + Quicker to write + Shorter + Ease of programming + Minimizes the time to develop and maintain code + Modular and object-oriented + Large community of users + A large standard and user-contributed library Python for Data Analytics Fairly easy to read/write/process data using standard features Plus special packages for… Numerical and statistical manipulations - numpy Visualization (“plotting”) - matplotlib Relational database like capabilities – pandas Machine learning - scikit-learn Network analysis - networkx Unstructured data – re, nltk, PIL More on Python Numpy api Simple but not limited. Numpy: the main API used for what is called “scientific computing ecosystem.” Numpy handles linear algebra and matrix mathematics on a very large scale. Most machine learning algorithms and neural networks operate on these n-dimensional matrices. Apache Spark has a Python shell. You can open datasets, do transformations, and run algorithms in one easy command line. Without that you would have to package your program and then submit it to Spark using spark-submit. The disadvantage with spark-submit, as with any batch job, is you cannot inspect variables in real time. So can print values to a log. That’s OK for text, but when you use the Python shell that text is an object, which means you can further work with it. It’s not a static non-entity. Anaconda Installation Instruction https://www.anaconda.com/download You can download the anaconda installation files suitable for your operating system from the link above. The Python course will be conducted via the Jupyter Notebook program. Python Anaconda Python Anaconda is an integrated python distribution system for developers who want to develop various similar scientific applications such as data science, analytics, and machine learning using Python. In addition to containing packages, environment managers and more than 1500 open source packages for scientific applications such as artificial intelligence, data science and analysis, it also includes developer interfaces such as "Spyder" and "Jupyter Notebook". Why Should We Use Anaconda? It is a free, open source structure with detailed help documentation that serves a wide community. It has a structure that eliminates package dependency and version control problems in projects. Jupyter allows us to create data science projects using various IDEs such as JupyterLab, Spyder, and RStudio. While creating our project files, we can easily install the Python version we want with the features we want. When using Python pip packages, it takes into account compatibility with other packages, eliminating errors that may occur in installing the correct packages. Why Is Python Popular? When it first came out, it was not so popular in data science. It was used in scientific calculations. It became popular thanks to its libraries. Scikitlearn = machine learning algorithms converted into code. There are machine learning algorithms Matplotlib = libraries related to plotting graphics. Pandas = makes the coding part much easier.A powerful language over text data. The number of people using Python is increasing day by day. This makes it easy to find a solution. There are a lot of documentation and tutorials. Python-R Numpy (Numerical Python)= It is one of the Python language libraries that allows us to work with multidimensional arrays and matrices and perform mathematical operations. Pandas= It is a software library written in the Python programming language for data processing and analysis. Matplotlib= The most popular library used for visualization. Scikit-learn= The library with the most active use in the use of machine learning algorithms. Python 2.x –Python 3.x – IDE (Integrated Development Environments) Python 3.11 IDEs PyCharm Spyder Python Tools for Visual Studio Komodo IDE Jupyter Notebook Spyder Package Manager Virtual Environment A virtual environment is used to manage packages in different projects. Using a virtual environment eliminates problems that may occur after installing packages globally.