WEEK 1 SLIDES .pdf
Document Details
Uploaded by WondrousNewOrleans
Loyalist College
Full Transcript
Introduction to Data Analytics Chapter 1 This Photo by Unknown author is licensed under CC BY-SA-NC. Learning Objectives for this Lesson Understand big data Explain the key components of data analytics Outline the history of data analytics Describe information flow withi...
Introduction to Data Analytics Chapter 1 This Photo by Unknown author is licensed under CC BY-SA-NC. Learning Objectives for this Lesson Understand big data Explain the key components of data analytics Outline the history of data analytics Describe information flow within a data processing chain Difference between database and data warehouse Define data mining Define data visualization What is Data? Data -a set of values of qualitative or quantitative variables. Our Data Tidy Data What is Big Data? Data can come from many sources and can take many forms Key characteristics of big data (3V’s) Volume Billions of rows of data and millions of columns Variety Complexity of data types and structures Structured (transaction data) Semi-structured (textual data files such as email) Quasi-structured (Textual data with erratic data formats that can be formatted with effort, tools, and time such as webclick stream data) Unstructured (no inherent structure such as pdf’s, images, and video) Velocity The speed at which data is being generated, stored, or processed Two other V’s Veracity: Finding the truth in the data Value: What is the value added or potential benefit of the data for business What is Driving the Big Data Revolution Mobile computing Video surveillance Social networking Gene sequencing Artificial Intelligence (https://www.sas.com/en_us/insights/analytics/what- is-artificial-intelligence.html) Internet of things (https://www.youtube.com/watch?v=uEsKZGOxNKw) Data Analytics: Evolution and Growth (how data science became the sexiest job of the 21st century) The practice of data analysis has gradually developed over time, gaining huge benefits from evolution in computing. Statistics is a big part of data analytics and which has a pretty long history (operations research, resource allocation and scheduling). Computing plays a crucial role in data analytics (processing power, storage including relational databases and data warehouses, business intelligence solutions). Data mining SAS and Excel R technology for the statisticians Python is a general-purpose programming language (packages for machine learning) Big Data Real time processing (Apache framework) and Hadoop framework Cloud based services (AWS) What is Data Analytics? It is a process of analyzing data (small or large) using mathematics, statistics, and computer science. It is the process of using data to build models that will aid in better decision making ultimately providing value to the decision maker. Providing meaning to raw data Collect data from various sources and use it to solve problems Data visualization is part of it Data processing Decision making Generate insights and look at patterns User understanding This Photo by Unknown author is licensed under CC BY-SA. Key Components of Data Analytics Business Intelligence (BI) A set of information technology solutions Includes tools for gathering, analyzing, and reporting information to various users about the performance of the company and its environment It provides the decision makers valuable information and knowledge using a variety of different sources of data Enables informed decision making Retail sales example Money ball movie example Pattern Recognition A pattern is a design or a model that appears to tell some story Pattern may be intuitive or purely accidental Patterns can be: Temporal Spatial Functional Long established pattern can be broken Business domain knowledge is very important in identifying patterns Data Processing Chain Different Sources of Data Operational records Machine logs Individual telling stories Website data Government data Industry reports Paper reports Meta data which is data about data Youtube video and the time of upload, the account from which it is uploaded, downloads and views are all meta data about the actual data which is the youtube video Continuous Vs Discrete A continuous variable can be any of the infinite number of values over a given interval. A continuous variable is generally measured, not counted. Temperature Height Weight Pressure Force A discrete variable is a finite number and is based on counts. Data can be sorted into distinct, countable, and in completely separate categories. The count value can not be divided further on an infinite scale Ratings Anything that can be counted Different Types of Data: Categorical Vs Quantitative Vs Others Categorical Nominal data: Unordered collection of values/categorical variables Ordinal data: Ordered values/categorical variables Quantitative Interval data: numeric values defined in a certain range and difference between the values is meaningful. A zero does not necessarily mean a lack of value. (e.g. temperature) Ratio data: any numeric value Both can be Continuous and Discrete Binary large objects (BLOB’s): audio, video, graphs Identify the Variables (Continuous Vs Discrete) Travel Distance Number of pages in a book Time to run a mile Number of phone per household Number of courses taken in a semester What about Age? Can it be Ordinal ? What is Datafication? Another term for enterprises to move towards a data driven culture Almost every phenomenon is now being observed and stored More devices connected to Internet More people constantly connected to their phones or Internet Every click on the website and every movement of the mobile device Data is being stored in more detailed resolution Database A collection of data that is accessible in many ways A data model represents the key entities and their relationships Relational data model is the most popular one Data model for managing customer orders in a sales organization Customers Orders Products Interrelationships Databases have grown in size over the last few years (why?) Common database management systems (DBMS) include Oracle, DB2, MySQL and Postgres. This Photo by Unknown author is licensed under CC BY-SA-NC. Data Warehouse An organized store of data from all over the organization Main purpose of data warehouse is to help in reporting and making business decisions. It is a simpler version of the operational database Data warehouse data cumulatively grows and the data values are not updated. A separate data warehouse allows analysis to go separately in parallel without burdening the operational databases. Business objective of tracking sales of movies and making decisions about managing inventory can be handled by creating a simple data warehouse. This Photo by Unknown author is licensed under CC BY-SA-NC. Database v/s Data warehousing Function Database Data Warehouse Purpose Data stored in databases can be used for many Data stored in DW is cleansed data useful for purposes including day-to-day operations reporting and analysis Granularity Highly granular data including all activity and Lower granularity data; rolled up to certain transaction details key dimensions of interest Complexity Highly complex with dozens or hundreds of data files, Typically organized around a large fact tables, linked through common data fields and many lookup tables Size Database grows with growing volumes of activity and Grows as data from operational databases is transactions. Old completed transactions are deleted rolled-up and appended every day. Data is to reduce size retained for long-term trend analyses Architectural choices Relational, and object-oriented, databases Star schema, or Snowflake schema Data Access Primarily through high level languages such as SQL. Accessed through SQL; SQL output is mechanisms Traditional programming access DB through Open forwarded to reporting tools and data DataBase Connectivity (ODBC) interfaces visualization tools Data Mining It is the art and science of discovering useful and innovative patterns from data. Wide variety of patterns can found from data Many different techniques can be used to find patterns from the data Data mining tasks take resources as data needs to be gathered, cleaned, organized and mined with many different techniques. A solid business case is needed to undertake data mining This Photo by Unknown author is licensed under CC BY-SA-NC. efforts. Data mining can be used across various fields including but not limited to retail, finance, supply chain, sports, healthcare. Data Mining Movies Sales by Quarters – Cross-tabulation 1. What is the best selling movie by revenue? Qtr/Product Gone With the Wind Matrix Monty Python Total Sales Amount 2. What is the best quarter by revenue Q2 $15 0 $30 $45 this year? Q3 $30 $36 $30 $96 3. Any other patterns? Q4 $15 0 $18 $33 Total Sales Amount $60 $36 $78 $174 Data Mining Data can be analyzed in many different ways For the previous example if cross tabulation was to include customer location data the following questions could also be answered: What is the best selling geography? What is the worst selling geography? Any other patterns? The value of insight depends upon the business problem being solved Supervised v/s Unsupervised Learning General Customer Population (Two separate questions) Can we find a group of customers who are highly likely to upgrade our service in the near future? Do our customers naturally fall into different groups? The two terms are borrowed from the field of machine learning When there is a specific purpose or a target is specified for grouping use supervised data mining techniques When there is no specific purpose or target is specified for grouping use unsupervised data mining techniques Data Mining Techniques Supervised Learning (works with labelled data) Classification (e.g. decision trees) Regression Artificial Neural Networks Unsupervised Learning (works with unlabeled data) Cluster Analysis Association Rule Mining Data Visualization Data visualization helps in absorbing the information in real time across various levels within an organization e.g. dashboards Recognize and eliminate the clutter around your information Aids in communicating the patterns or the results to intended audience in an effective manner Direct the attention of the intended audience to the most relevant parts of the information This Photo by Unknown author is licensed under CC BY-SA. Business Drivers for data analytics Data Science and big data analytics, EMC, 2008, pg 11 What Have we Learned Big data Business intelligence and the data mining cycle Data processing chain Data mining techniques Data visualization and why is it important This Photo by Unknown author is licensed under CC BY-NC-ND.