COS10022 2024 Semester 2 Final Assessment Revision PDF
Document Details
Uploaded by Deleted User
2024
Tags
Summary
This document is a revision document for a 2024 data science final exam (COS10022). It includes questions related to the different data types, structures, the differences between data science and other concepts (data warehousing), the characteristics of big data (the four V's), components of big data ecosystem, data devices and their purpose, and the general overview of data science and big data ecosystem. Questions include multiple-choice, multiple answers, matching questions.
Full Transcript
Final Online Test (30%) Date (During Week 12 Tutorial Sessions): Group 1: Tuesday (26th November) at A312, 8 am - 10 am Group 2: Thursday (28th November) at B219, 1 pm - 3 pm Group 3: Friday (29th November) at A312, 10 am - 12 pm Group 4: Wednesday (27th November) at A312, 8 am - 10...
Final Online Test (30%) Date (During Week 12 Tutorial Sessions): Group 1: Tuesday (26th November) at A312, 8 am - 10 am Group 2: Thursday (28th November) at B219, 1 pm - 3 pm Group 3: Friday (29th November) at A312, 10 am - 12 pm Group 4: Wednesday (27th November) at A312, 8 am - 10 am Group 5: Friday (29th November) at B219, 2 pm - 4 pm The use of Mobile Phone and ChatGPT is prohibited during the test. You must participate in this test in person on campus. About the Assessment: This is a close-book test consisting of 50 questions (120 points). 40 questions x 2 points = 80 points 10 questions x 4 points = 40 points Format: Multiple choice questions (with ONE correct answer) Multiple choice questions (with up to TWO correct answers) Matching questions 2-point 4-point question question Overview of Data Science and The Big Data Ecosystem COS10022 Data Science Principles What are the components in the Big Data Ecosystem? COS10022 Data Science Principles What are the differences between data devices, data collectors, data aggregators, and data users and buyers? Data Devices Gather data from multiple locations. Continuously generate new data about this data. For each Gigabyte created for this data, an additional Petabyte of data is created about that data. Consider Data generated from someone playing an online video game through a PC, game console (PlayStation, Xbox, Nintendo Wii), or smartphone. COS10022 Data Science Principles What are the differences between data devices, data collectors, data aggregators, and data users and buyers? Data Collectors Entities that collect data from the device and users. Consider Cable TV provider tracks: the shows a person watches which TV channels someone will and will not pay for to watch on demand the prices someone is willing to pay for premium TV content COS10022 Data Science Principles What are the differences between data devices, data collectors, data aggregators, and data users and buyers? Data Aggregators Entities that compile and make sense of the data collected by data collectors. Transform and package the data as products to sell. Website: Falcon YouTube: Falcon Social Listening COS10022 Data Science Principles What are the differences between data devices, data collectors, data aggregators, and data users and buyers? Data Users and Buyers Direct benefactors of the data collected and aggregated by others within the data value chain. Examples Corporate customers Information brokers Analytic services Credit bureaus Media archives Catalog co-ops Advertising Checkout Falcon’s customers: https://www.falcon.io/our-customers/ COS10022 Data Science Principles What are the four V’s (characteristics) of Big Data? Big Data: What is Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value. Scale Distribution Diversity Timeliness Four V’s of Big Data 1. Volume (scale) 2. Variety (diversity, distribution) 3. Velocity (timeliness) 4. Veracity (i.e. pertaining to the accuracy of data) COS10022 Data Science Principles What are the differences between Data Science and Data Warehouse? Data Science vs Enterprise Data Warehouse Data Warehouse (DW) is a relational database that is designed for query and analysis rather than for transaction processing. Contains selective, cleaned, and transformed historical data. ETL (Extraction , Transformation, and Loading) OLAP (On-Line Analytical Processing) Supports enterprise decision making Image: https://upload.wikimedia.org/wikipedia/commons/4/46/Data_warehouse_overview.JPG Source: https://docs.oracle.com/cd/B10500_01/server.920/a96520/concept.htm COS10022 Data Science Principles What are the differences between Data Science and Data Warehouse? Data Science vs Enterprise Data Warehouse Limitations of the Enterprise DW analytics : 1. High-value data is hard to reach and leverage. Low priority for data science projects. 2. Data usually moves in batches from DW to local analytics tools (e.g. R, SAS, Excel). In-memory analytics; dataset size constraints. 3. Data science projects remain isolated and ad hoc, rather than centrally managed. Source: Dietrich, D. ed., 2015, page 14 Data Science& Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. Data science initiatives not aligned with corporate strategic business goals. COS10022 Data Science Principles What are the differences between Data Science and Data Warehouse? Big Data vs Enterprise DW / Business Intel. The four V’s of Big Data will not work well with the traditional Enterprise Data Warehouse. Centralized, purpose-built space (lack of agility) Supports Business Intelligence and reporting (restrict robust analyses) Analysts must depend on IT group and DBAs for data access (lack of control) Analysts must spend significant time to aggregate and dis-aggregate data from multiples sources (reduces timeliness) To succeed, Big Data analytics require different approaches. COS10022 Data Science Principles What are the purposes of analytic sandbox in Data Science projects? Analytic Sandbox (Workspaces) Resolve the conflict between the needs of analysts and the traditional EDW or other formally managed corporate data. Data assets gathered from multiple sources and technologies for analysis. Enables flexible, high performance analysis in nonproduction environment. Reduces costs and risks associated with data replication into `shadow’ file systems. `Analyst owned’ rather than `DBA owned’. COS10022 Data Science Principles What are the differences between DS and BI? The applications of DS and BI. Data Science vs Business Intelligence Source: Dietrich, D. ed., 2015 Data Science& Big Data Analytics: Discovering, Analyzing, Visualizingand Presenting Data. COS10022 Data Science Principles What are the different data structures? Examples of difference data structures. Big Data: Data Structures Source: http://www.tsmtutorials.com/2016/06/data-and-information-basics.html COS10022 Data Science Principles What are the different data structures? Examples of difference data structures. Big Data: Data Structures Source: http://www.tsmtutorials.com/2016/06/data-and-information-basics.html COS10022 Data Science Principles What are the different data structures? Examples of difference data structures. Big Data: Data Structures Source: http://www.tsmtutorials.com/2016/06/data-and-information-basics.html COS10022 Data Science Principles What is Data Science? What is the principal goal of Data Science? Data Scientist: What is Defined by usage In Academia: “a scientist, trained in anything from social science to biology, who works with large amounts of data, and must grapple with computational problems posed by the structure, size, messiness, and the complexity and nature of the data, while simultaneously solving a real problem.” (O’Neil & Schutt 2014) COS10022 Data Science Principles What is Data Science? What is the principal goal of Data Science? Data Scientist: What is In Industry: “a data scientist is someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning, as well as being human.” (O’Neil & Schutt 2014) COS10022 Data Science Principles What are the different data types? Data Types Before storing the data, let’s take a look at what kinds of data are commonly seen in our daily life. Text Data Image Data Audio Data Streaming/Video Data 3-D Image Trajectory Data COS10022 Data Science Principles What are the different data types? Written in limited symbols. Usually coded by ASCII, Unicode, or other coding Text standards. Data Information can usually be extracted by semantic ASCII code uses 8 bits to encode the symbols analysis. while Unicode uses 8, 16, or 32 bits to include Generally much more symbols in different languages. presented in 1-D form. COS10022 Data Science Principles What are the different data types? Can be treated as time-series data. The amplitude corresponds to the Audio volume. ◦ In fact, the audio signal is composed of two parts: the Data dc component and the ac component. The ◦ The DC part serves as a bias for raising the level of frequency volume. It is usually removed before analysing the corresponds signal. to the pitch (tone). Generally presented in 1-D form. https://pudding.cool/2018/02/waveforms/ COS10022 Data Science Principles What are the different data types? Digital image is subsampled from the real-world continuous space. Pixel is the basic unit for representing a Image digital image. 𝑥𝑥𝑥𝑥 Data Generally presented in 2-D form. Some models used for storing imagesisolate the brightnessfrom the colour channelswhile some processes the colourwith 𝑦𝑦𝑦𝑦 the brightness. COS10022 Data Science Principles What are the different data types? Instead of using “Pixel” to represent a point, the 3D image uses “Voxel” to represent a point. 3-D Image Different layers can be retrieved by slicing the 3-D image along different axis COS10022 Data Science Principles What are the different data types? Digital video and streaming data are subsampled from the real- world continuous space. Each frame is an image, which is combined by the timeline. Video/ Streaming 𝑥𝑥𝑥𝑥 Data Generally presented in 3-D 𝑡𝑡𝑡𝑡 form. Again, the colour information may be processed independently from the brightness. 𝑦𝑦𝑦𝑦 COS10022 Data Science Principles What are the different data types? Trajectory data is usually collected by GPS and is more like a log file of a specific object’s movement. The information contained in a single data point include the geolocation and the time stamp. Trajectory Data Will be much more meaningful if combined with other data in the analytics. Generally presented in 3-D form. COS10022 Data Science Principles What are the different data types? Categories Text 1-D Audio Category Data Text Data 2-D Image Audio Streaming/Video Data 3-D Image Trajectory Numerical Data Streaming/Video Continuous Trajectory Data Discrete Etc. COS10022 Data Science Principles How to Store Data For Your Data Science Process? Structured or Unstructured Data? Structured Data Structured data is arranged in a specific manner in tables. It is more suitable for structured database such as MySQL, PostgreSQL, etc. Semi-structured Data Semi-structured data is a form of structured data that does not obey the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Unstructured Data Unstructured data is information that either does not have a predefined data model or is not organised in a pre-defined manner. Unstructured data is typically text-heavy. Unstructured database such as MongoDB is generally used to store the unstructured data. COS10022 Data Science Principles How to Store Data For Your Data Science Process? How to Store Data For Your Data Science Process? Identify Your Goals Identifying the goals is the first step in process of data storage because all following steps will depend on this. Big Data or Small Data After having a clear cut goal, you need to decide which type of data you need. It’s totally depending on your goal and the available resources. Avoid Data Fatigue Try not to over storage or measurement of data which is useless for you or doesn’t align with your data collection goals. Data Management Chose either SQL or NoSQL way to manage the stored data. COS10022 Data Science Principles https://medium.com/@saeeddev/learn-how-to-store-data-for-your-data-science-problems-even-if-you-dont-want-to-27e02dd9f781 Data Analytics Lifecycle COS10022 Data Science Principles What are the 6 phases of the data analytics lifecycle? Data Analytics Lifecycle COS10022 Data Science Principles What are the 6 phases of the data analytics lifecycle? Data Analytics Lifecycle Emphasises the following principles of data science best practices: a. Data science project is iterative. It is possible to move forward or backward between most phases in the lifecycle. A project work can occur in several phases at once. b. The best gauge of advancing to the next phase is to ask key questions to test whether the data science team has accomplished enough to move forward. c. Ensure teams do the appropriate work both up front, and at the end of the projects, in order to succeed. Too often teams focus on Phases 2 to 4, and want to jump into doing modeling work before they are ready. COS10022 Data Science Principles What are the key activities in Phase 1-Discovery? Phase 1 - Discovery Data science team learns the business domain, and assesses the resources available to support the project in terms of people, technology, time, and data. Important activities include framing the business problem as an analytics challenge that can be addressed in subsequent phases and formulating initial hypotheses (IHs) to test and begin learning the data. COS10022 Data Science Principles What are the key activities in Phase 1-Discovery? Phase 1 - Discovery Key activities: 1. Learning the business domain Determine how much business or domain knowledge that a data scientist needs in order to develop models in Phases 3 and 4. To decide the resources needed for the project team; To ensure that the team has the right balance of domain knowledge and technical expertise. COS10022 Data Science Principles What are the key activities om Phase 1-Discovery? Phase 1 - Discovery Key activities: 2. Assessing the resources available to support the project Consider the available tools and technology the team will be using, and the types of systems needed for later phases. Resources include technology, tools, systems, data, and people. Take inventory of the types of data available to the team for the project. Ensure the data science team has the right mix of domain experts, customers, analytic talent, and project management. COS10022 Data Science Principles What are the key activities in Phase 1-Discovery? Phase 1 - Discovery Key activities: 3. Framing the problem Framing is the process of stating the analytics problem to be solved. Write down the problem statement and share it with the key stakeholders. Identify the main objectives of the project, identify what needs to be achieved in business terms, and identify what needs to be done to meet the needs. Establish failure criteria. COS10022 Data Science Principles What are the key activities in Phase 1-Discovery? Phase 1 - Discovery Key activities: 4. Identifying key stakeholders Stakeholders include anyone who will benefit from the project or will be significantly impacted by the project. Articulate the ‘pain points’ to be addressed. Outline the type of activity and participation expected from each stakeholder. COS10022 Data Science Principles What are the key activities in Phase 1-Discovery? Phase 1 - Discovery Key activities: 5. Interviewing the Analytics Sponsor What business problem is the team trying What timelines need to be considered? to solve? What is the desired outcome of the Who could provide insight into the project? project? What data sources are available? Who has the final decision-making authority on the project? What industry issues may impact the How will the focus and scope of the analysis? problem change if time, people, risk, resources, or the size and attributes of the data change? COS10022 Data Science Principles What are the key activities in Phase 1-Discovery? Phase 1 - Discovery Key activities: 6. Developing Initial Hypotheses (IHs) Form ideas that can be tested with data. Start with just a few primary hypotheses/ideas, then develop several more. Compare their answers with the outcome of an experiment. Gather and assess hypotheses from stakeholders and domain experts. H0: “Stores located in city areas have higher sales because of higher demand for household products” COS10022 Data Science Principles What are the key activities in Phase 1-Discovery? Phase 1 - Discovery Key activities: 7. Identifying potential data sources Identify data sources Capture aggregate data sources Review the raw data Evaluate the data structures and tools needed Scope the sort of data infrastructure needed for this type of problem COS10022 Data Science Principles What are the key activities in Phase 1-Discovery? Phase 1 → Phase 2 Data Analytics Lifecycle is intended to accommodate ambiguity. This reflects most real-life situations. The data science team can move to the next phase if it has enough information to draft an analytics plan and share it for peer review. Do I have enough information to draft an analytic plan and share for COS10022pInetreordurce tiv onie towD? ata Science What are the key activities in Phase 2-Prepation? Phase 2 – Data Preparation Phase 2 requires the presence of an analytics sandbox, in which the data science team work with data and perform analytics for the duration of the project. The team performs ETLT to get the data into the sandbox, and familiarize themselves with the data thoroughly. ETL + ELT = ETLT (Extraction, Transform and Load) COS10022 Data Science Principles What are the key activities in Phase 2-Prepation? Phase 2 – Data Preparation Key activities: 1. Preparing the analytic sandbox An analytic sandbox (or, workspace) allows the data science team to explore the data without interfering with live production databases. Collect all kinds of data into the sandbox, ranging from summary- level aggregated data, structured data, raw data feeds, and unstructured text data from call logs/web logs. IT group may require justification to develop an analytic sandbox. Expect the sandbox to be LARGE (5× – 10× of original datasets). COS10022 Data Science Principles What are the key activities in Phase 2-Prepation? Phase 2 – Data Preparation Key activities: 2. Performing ETLT ETLT is a combined approach where the team may choose to perform either ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) when populating the sandbox. This choice depends on the team’s specific goals. Consider how to parallelize the movement of big datasets into the sandbox (Big ETL), e.g., Hadoop, MapReduce, Twitter API. Prior to data movement, determine the kinds of transformation to be performed on the data. Perform data gap analysis. COS10022 Data Science Principles What are the key activities in Phase 2-Prepation? Phase 2 – Data Preparation Key activities: 3. Learning about the data Understand the acceptable range of values, expected output, and data entry errors. Identify additional data sources that the team can leverage but currently unavailable. It is advisable to build a dataset inventory. COS10022 Data Science Principles What are the key activities in Phase 2-Prepation? Phase 2 – Data Preparation A sample of dataset inventory Source: Dietrich, D. ed., 2015. Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. EMC Education Services. COS10022 Data Science Principles What are the key activities in Phase 2-Prepation? Phase 2 – Data Preparation Key activities: 4. Data conditioning The process of cleaning data, normalizing datasets, and performing transformations on the data. Join or merge different datasets to be ready for analyses. Decide which aspects of datasets will be useful to analyze in later steps, i.e. which one to keep or discard. COS10022 Data Science Principles Phase 2 → Phase 3 Do I have enough good quality data to start building the model? The data science team can move to the next phase if it has become deeply knowledgeable about the data. COS10022 Data Science Principles What are the key activities in Phase 3-Model Planning? Phase 3 – Model Planning The data science team determines the methods, techniques, and workflow it intends to follow for the subsequent model building phase. The team also explores the data to learn about the relationships between variables, and subsequently selects key variables and the most suitable models. COS10022 Data Science Principles Phase 3 → Phase 4 Do I have a good idea about the type of model to try? Can I refine the analytic plan? The data science team can move to the next phase once it has a good idea about the type of model to try and the team has gained enough knowledge to refine the analytics plan. COS10022 Data Science Principles What are the key activities in Phase 4: Model Building? Phase 4 – Model Building The data science team develops datasets for testing, training, and production purposes. The team builds and executes models based on the work done in the model planning phase. The team also considers the sufficiency of the existing tools to run the models, or whether a more robust environment for executing the models is needed (e.g. fast hardware, parallel processing, etc.). COS10022 Data Science Principles Phase 4 → Phase 5 The data science team can move to the next phase if the model is sufficiently robust to solve the problem, or if the team has failed. COS10022 Data Science Principles What are the key activities in Phase 5: Communicate Results? Phase 5 – Communicate Results The data science team, in collaboration with major stakeholders, determines if the results of the project are a success or a failure based on the criteria developed in Phase 1. The team should identify key findings, quantify the business value, and develop a narrative to summarize and convey findings to stakeholders. COS10022 Data Science Principles Phase 5 → Phase 6 The data science team can move to the next phase if the team has documented and reported the key findings, and major insights have been derived from the analysis. COS10022 Data Science Principles What are the key activities in Phase 6: Operationalize? Phase 6 – Operationalize The data science team delivers final reports, briefings, code, and technical documents. The team may run a pilot project to implement the models in a production environment. COS10022 Data Science Principles Data Preparation COS10022 Data Science Principles What causes noisy data? Examples of noisy data. Noisy Data Noise is a random error or variance in a measured variable. Att. Noise Class Noise Incorrect attribute values may be due to: Att1 Att2 Class faulty data collection instruments 0.25 Red Positive Class Noise: data entry problems 0.25 Red Negative Contradictory examples data transmission problems 0.99 Green Negative Mislabeled examples technology limitation 1.02 Green Positive Attribute Noise: inconsistency in naming convention 2.05 ? Negative Erroneous values Missing values ? Green Positive Other data problems which require data cleaning: duplicate records incomplete data inconsistent data 56 How to handle noisy data? Noisy Data How to Handle Noisy Data? 1. Binning This method smooth a sorted data value by consulting its “neighborhood”, that is, the values around it. The sorted values are distributed into a number of “buckets”, or “bins”. The data for price are first sorted and then partitioned into equal frequency bins of size 3 (i.e. each bin contains three values). Each original value in a bin is replaced by the mean value of the bin (i.e. the value 9). The min. and max. values in a given bin are identified. Each bin value is then replaced by the closed boundary value. How to handle missing data? Incomplete Data How to Handle Missing Data? 1. Ignore the tuple This is usually done when class label is missing (assuming the mining task involves classification). This method is not very effective, unless the tuple contains several attributes with missing values. 2. Fill in the missing values manually This method is time consuming and may not be feasible given a large dataset with many missing values. 3. Use a global constant to fill in the missing values Replace all missing attribute values using the same constant (such as “Unknown”, “N/A”). A mining program may mistakenly think that they form an interesting concept since they all have a value in common. How to handle missing data? Incomplete Data How to Handle Missing Data? 4. Use a measure of central tendency for the attribute (e.g. the mean or median) to fill in the missing values For normal data distributions, the mean can be used, while skewed data distribution should employ the median. 5. Use the attribute mean or medium for all samples belonging to the same class as the given tuples E.g. If classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple. How to handle missing data? Incomplete Data How to Handle Missing Data? 6. Use the most probable value to fill in the missing value This may be determined with regression, interference- based tools using Bayesian formalism or decision tree induction. E.g. Using the other passenger attributes in the titanic dataset, you may construct a decision tree to predict the missing values for sibsp. How to detect data discrepancies? Data Discrepancies The first step in data cleaning as a process is discrepancy detection. Discrepancies can be caused by: Poorly designed data entry forms Human errors in data entry Deliberate errors e.g. respondents not wanting to divulge information about themselves Data decay e.g. outdated addresses Errors in instrumentation devices that record data System errors Inconsistencies due to data integration e.g. where a given attribute can have different names in different databases How to detect data discrepancies? Data Discrepancies How to Detect Data Discrepancies? 1. Metadata Use any knowledge that you may already have regarding properties of the data E.g. What are acceptable values for each attribute? Do all values fall within the expected range? What are data type and domain of each attribute? 2. Check uniqueness rule, consecutive rule and null rule Unique rule: Each value of the given attribute must be different from all other values for that attribute Consecutive rule: There can be no missing values between the lowest and highest values for the attribute, and that all values must also be unique. Null rule: Specifies the use of blanks, question marks, special characters, or other strings that may indicate the null condition. How to detect data discrepancies? Data Discrepancies How to Detect Data Discrepancies? 3. Use commercial tools Data scrubbing: use simple domain knowledge (e.g. postal code, spell-check) to detect errors and make corrections Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g. correlation and clustering to find outliers) Learn to interpret correlation coefficient and covariance. Data Integration Redundant data often occurs when integrating multiple databases. Object identification: The same attribute or object may have different names in different databases Derivable data: One attribute may be a “derived” attribute in another table (E.g. annual revenue) Redundant attributes may be able to be detected by correlation analysis. The analysis measure how strongly one attribute implies the other, based on the available data. For categorical data, Χ2 (Chi-Square) test is used. For numerical data, Correlation Coefficient and Covariance are used. 64 Learn to interpret correlation coefficient and covariance. Data Integration How to Detect Redundant Attributes? 1. Correlation Coefficient (r) for Numerical Data Also called Pearson’s Product Moment Coefficient. ∑ ∑ n n (ai − A)(bi − B) (a ibi ) − n AB rA,B = i=1 = i=1 (n −1)σ Aσ B (n −1)σ Aσ B where n is the number of tuples, A and B are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product. rA,B > 0 : Positively correlated rA,B = 0 : Independent rAB < 0 : Negatively correlated Learn to interpret correlation coefficient and covariance. Data Integration Visually evaluating correlation using scatter plots Scatter plots showing the correlation coefficient from -1 to 1. r = 1.0 : A perfect positive relationship r = 0.8 : A fairly strong positive relationship r = 0.6 : A moderate positive relationship r = 0.0 : No relationship r = -1.0 : A perfect negative relationship Learn to interpret correlation coefficient and covariance. Data Integration How to Detect Redundant Attributes? 2. Covariance (Cov) for Numerical Data Consider two numeric attributes A and B, and a set of n observations {(a1, b1), …, (an, bn)}. The mean values of A and B, are also known as the expected values of A and B, that is: The covariance between A and B is defined as: It can be simplified in computation as: Learn to interpret correlation coefficient and covariance. Data Integration Visually evaluating covariance between two variables using scatter plot. Cov(A, B) < 0 : A and B tend to move in opposite direction Cov(A, B) > 0 : A and B tend to move in the same direction Cov(A, B) = 0 : A and B are independent. Note that: Zero covariance are not necessarily mean that the variables are independent. A non-linear relationship can exist that still would result in covariance value of zero. What are the different data reduction methods? Data Reduction Strategies 1. Data cube aggregation Aggregation operations are applied to the data in the construction of a data cube. 2. Attribute subset selection Irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed. 3. Dimensionality reduction Encoding mechanisms are used to reduce the data set size. 4. Numerosity reduction The data are replaced or estimated by alternative, smaller data representations 5. Discretization and concept hierarchy generation Raw data values for attributes are replaced by ranges or higher conceptual levels. What are the different data transformation methods? Data Transformation Data transformation strategies: 1. Smoothing: Remove noise from data using techniques such as binning, regression and clustering. 2. Attribute/feature construction: construct new attributes from the given set of attributes. 3. Aggregation: Construct data cubes 4. Normalization: Scale the attribute date to fall within a smaller, specified range such as -1.0 to 1.00, or 0.0 to 1.0. 5. Discretization: Replace raw values of a numeric attribute (e.g. age) with interval label (e.g. 0-10 11-12) or conceptual labels (e.g. youth, adult, senior). 6. Concept hierarchy generation for nominal data: Generalize attributes such as street to higher- level concepts such as city or country. What are the different data normalisation methods? Normalization Min-Max Z-score Decimal scaling Transforms the data into a desired range, usually Useful when the actual min and max Transform data into a range [0, 1]. of attribute are unknown. between [-1, 1]. v' = v − minA (new _ max − new _ min ) + new _ min µ A A A v' = v − A v' = v maxA − minA σ A 10 j Where, [minA, maxA] is the initial range and Where μA and σA are the mean and Where j is the smallest integer [new_minA, new_maxA] is the new range. standard deviation of the initial data such that Max(|ν’|) < 1. values. Let income range $12,000 to $98,000 normalized Let μ = $54,000, σ = $16,000. Then Suppose that the values of A to [0.0, 1.0]. Then $73,000 is mapped to: $73,600 is transformed to: range from -986 to 917. Divide each value by 1000 (i.e. j = 3): 73,600−12,000 (1.0− 0) + 0 = 0.716 73,600− 54,000 = 1.225 -986 normalizes to -0.986 and 98,000 −12,000 16,000 917 normalizes to 0.917. What are the different data discretization methods? Discretization Attribute Data discretization transforms numeric data by mapping values to interval or concept label. Categorical Numerical Nominal Ordinal Continuous Discrete Discretization techniques: Categories Categories Takes any Integer Binning, Histogram analysis, are mutually are mutually values, value in a Cluster analysis, Decision tree exclusive exclusive range of typically and and ordered. values. counts. analysis, Correlation analysis unordered. E.g. Sex E.g. Disease E.g. Weight E.g. Days (male/femal Stage (mild/ in kg, Height sick per year For nominal data: e), Blood moderate/ in cm Group severe) Concept hierarchy (A/B/AB/O) What are the different data discretization methods? Discretization Binning Histogram A top-down unsupervised splitting technique based on An unsupervised method to partition the values of an a specified number of bins. attribute into disjoint ranges called buckets or bins. What are the different data discretization methods? Discretization Cluster Analysis A clustering algorithm can be applied to discretize a numerical attribute by partitioning the values of that attributes into clusters or groups. Unsupervised, top-down split or bottom-up merge) Partition dataset into clusters based on similarity Effective if data is clustered but not if data is “smeared” Cluster analysis using k-means (Lecture 6) What are the different data discretization methods? Discretization Decision tree analysis Use a top-down splitting approach Supervised: Make use of the class label (e.g. cancerous vs. benign) Using entropy to determine split point (discretization point: the resulting partition contains as many tuples of the same class as possible) What are the different data discretization methods? Discretization Correlation analysis Use a bottom-up merge approach Supervised: Make use of the class label (e.g. spam vs. genuine) ChiMerge: Find the best neighboring intervals (those having similar distributions of classes, i.e., low χ2 values) to merge. What are the different data discretization methods? Discretization (Concept hierarchy generation for categoricaldata) Nominal attributes have a finite (but possibly large) number of distinct values, with no ordering among the values. E.g. geographic_location, job_category, and item_type Concept hierarchies can be used to transform data into multiple levels of granularity. Concept hierarchy formation: Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior) What are the different data discretization methods? Discretization (Concept hierarchy generation for categorical data) Four methods for the generation of concept hierarchies: 1. Specification of a partial ordering of attributes explicitly at the schema level by users or experts E.g. street < city < state < country 2. Specification of a portion of a hierarchy by explicit data grouping E.g. {Urbana, Champaign, Chicago} < Illinois 3. Specification of a set of attributes System automatically generates partial ordering by analysis of the number of distinct values The attribute with the most E.g. street < town < country < country distinct values is placed at the lowest level of the hierarchy 4. Specification of only a partial set of attributes E.g. only street < town, not others Basic Data Analytics Methods COS10022 Data Science Principles What are the different data visualisation methods? Survey and Visualize Aisch, G 2012, ‘Using Data Visualization to Find Insights in Data’, in Data Journalism Handbook (ed.),O’Reilly Media Link: http://datajournalismhandbook.org/1.0/en/understanding_data_7.html Four (4) important types of data visualization: Tables are very powerful in dealing with a relatively small number of data points. Charts allow mapping multiple dimensions of the data to visual properties of geometric shapes. Maps can powerfully connect data to the physical world. Graphs (networks) show the interconnections between various types of real world objects. What is EDA? Why is EDA important? Part A: Exploratory Data Analysis (EDA) EDA is an approach to analyzing datasets to summarize their main characteristics, often with visual methods. Why is EDA important? Gain new insight Explore data structures Detect missing data Check significant variables Examine relationship between variables Select an appropriate model Check model assumptions Learn to interpret descriptive statistics Descriptive Statistics Descriptive statistics quantitatively describe the main features of data. Main data features: Measures of central tendency – represent a ‘center’ around which measurements are distributed. E.g. mean and median Measures of variability – represent the ‘spread’ of the data from the ‘center’. E.g. standard deviation Measures of relative standing – represent the ‘relative position’ of specific measurements in the data. E.g. quantiles Learn to interpret descriptive statistics Descriptive Statistics Mean Median Sum all the numbers and divide The exact middle value. by their count X = (x1 + x2 + … + x3) /n When the count is odd, find the middle value of sorted data. E.g. Mean = (2 + 3 + 4 + 5 + 6) /5 E.g. For Data 1, the median is 4. =4 When the count is even, find the means of the middle two values. E.g. For Data 2, the median is (3+4)/2 = 3.5. Learn to interpret descriptive statistics Descriptive Statistics Mean VS. Median When data distribution is skewed, median is more meaningful than mean. When data has outliers, median is more robust. The blue data point is the outlier in data 2. For Data 1, Mean = 4, Median = 4 For Data 2, Mean = 4.8, Median = 4 Learn to interpret descriptive statistics Descriptive Statistics Standard Deviation (SD) Computation steps: Compute mean Compute the deviation of each measurement from the mean Mean = 4 Square the deviations Deviations: -2, -1, 0, 1, 2 Sum the squared deviations Squared deviations: 4, 1, 0, 1, 4 Divide by (count-1) Sum = 10 Compute the square root. Standard deviation = √(10/4) = 1.58 Learn to interpret descriptive statistics Descriptive Statistics Quartiles 1st quartile is the measurement with 25% measurements smaller and 75% larger – lower quartile (Q1). 2nd quartile is the median. 3rd quartile is the measurements with 75% measurements smaller and 25% larger – upper quartile (Q3). Inter quartile range (IQR) is the difference between Q3 and Q1, i.e. Q3-Q1. Why is Data Visualisation important? Visualization before Analysis Importance of graphs in statistical analyses: Anscombe’s Quartet consists of four datasets with nearly identical statistical properties. Statistical Property Value Mean of x : 9 Variance of x : 11 Mean of y : 7.50 (to 2 decimal points) Variance of y : 4.12 or 4.13 ( to 2 decimal points One might concluded that Correlations between x and y : 0.816 these four datasets are Linear regression line : y = 3.00 + 0.50 x (to 2 decimal points) quite similar. However… Why is Data Visualisation important? Visualization before Analysis Each dataset is plotted as a scatterplot, and the fitted lines are the result of applying linear regression models. The figures show that: The regression line fits Dataset 1 reasonably well. Dataset 2 is definitely nonlinear. Dataset 3 exhibits a linear trend, with one Outlier apparent outlier at x = 13. With only points at two x values, it is not possible to determine that the linearity assumption is proper. What are questions that can be answered using hypothesis testing? Statistics can help answer the following data analytics questions: What are the best input variables for the model? Model Building Can the model predict the outcome given the input? Is the model accurate? Does the model perform better than an obvious Model Evaluation guess? Does the model perform better than other models? Is the prediction sound? Model Deployment Does the model have the desired effect (e.g. reducing cost)? What are questions that can be answered using hypothesis testing? Hypothesis Testing A common techniques used to assess the Distributions of two samples of data. difference of the means from two samples of data or the significance of the difference. The basic concept is to form an assertion and test it with data. The common assumption is that there is no difference between two samples. Statisticians refer this as the null hypothesis (H0). The alternative hypothesis (HA) is that there is a difference between two samples. What are questions that can be answered using hypothesis testing? Hypothesis Testing A hypothesis test leads to either rejecting the H0 in favour of the HA or not rejecting the H0. What are questions that can be answered using hypothesis testing? Difference of Means What cause Type I and Type II errors? Type I and Type II Errors A hypothesis test may result in two types of errors Type I Error: Rejection of the null hypothesis when the null hypothesis is TRUE Type II Error: Acceptance of the null hypothesis when the null hypothesis is FALSE The probability of the Type I error is denoted by the Greek letter α. The probability of the Type II error is denoted by the Greek letter β. What cause Type I and Type II errors? Type I and Type II Errors The significant level is equivalent to the Type I Error. For a significance level such as α = 0.05, if the H0 is TRUE (µ1 = µ2), there is a 5% chance that the observed T value based on the sample data will be large enough to reject the H0. Type I or By selecting an appropriate significance level, the probability of Type II error is more committing a Type I error can be defined before any data is dangerous? collected and analyzed. Justify your answer. To reduce the probability of a Type II error to a reasonable level, it is often necessary to increase the sample size. Association Rules Mining COS10022 Data Science Principles What is association rules mining? Overview Association rules Market Basket Analysis An unsupervised learning method. A descriptive, not predictive, method. Used to discover interesting hidden relationships in a large dataset. The disclosed relationships are represented as rules or frequent itemsets. Commonly used for mining transactions in database. What are the applications of association rules? Overview Some questions that association rules can answer: Which products tend to be purchased together? Of those customers who are similar to this person, what products do they tend to buy? To analyze customer buying habits by finding Of those customers who have purchased this associations and correlations between the product, what other similar products do they different items that customers place in their tend to view or purchase? shopping baskets. Understand the logics behind association rules. Overview The general logic behind association rules: 1. A large collection of transactions (depicted as three stacks of receipts, in which each transaction consists of one or more items). 2. Association rules go through the items being purchased to see what items are frequently bought together and to discover a list of rules that describe the purchasing behavior. 3. The rules suggest that when cereal is purchased, 90% of the time milk is purchased; when bread is purchased, 40% of the time milk is purchased also; when milk is purchased, 23% of the time cereal is also purchased. Understand rules, itemset, Apriori property, frequent itemset Overview Rules Each rule is in the form X Y Means when Item X is observed, Item Y is also observed. Itemset A collection of items or individual entities that contain some kind of relationship. An itemset containing k items is called a k-itemset. k-itemset = {item 1, item 2,…,item k} Examples: A set of retail items purchased together in one transaction. A set of hyperlinks clicked on by one use in a single session. Understand rules, itemset, Apriori property, frequent itemset Overview 1-itemsets Apriori Algorithm The most fundamental algorithms for generating association rules. 2-itemsets One major component of Apriori is support. Given an Itemset L, the support of L is the percentage of transactions that contain L. If 80% of all transactions contain itemset {bread}, then the support of {bread} is 0.8. If 60% of all transactions contain itemset {bread, butter}, then the support of {bread, butter} is 0.6. Frequent Itemset Items that appear together “often enough” (i.e. meets the minimum support criterion). If the minimum support is set at 0.7, {bread} is considered a frequent itemset; whereas {bread, butter} is not considered as a frequent itemset. Understand rules, itemset, Apriori property, frequent itemset Overview Apriori Property Frequent Also called downward closure property. Itemset If an item is considered frequent, then any subset of the frequent itemset must also be frequent. If 60% of the transactions contain {bread, jam}, then at least 60% of all the transactions will contain {bread} or {jam}; or It the support of {bread, jam} is 0.6, the support of {bread} or {jam} is at least 0.6. If itemset {B, C, D} is frequent, then all the subset of this itemset, shaded, must also be frequent itemsets. Metrics used to evaluate association rules Overview Association Rules Rule evaluation Metrics An implication expression of the form X Support (s): No. of transactions that contain Y, where X and Y are non-overlapping both X and Y out of total no. of transactions. itemsets. E.g. A support of 2% means that 2% of all the E.g. {Milk, Diaper} {Beer} transactions under analysis show that {Milk, Diaper} and {Beer} are purchased together. Generating association rules: Step 1: Find frequent itemsets whose Confidence (c): No. of transactions that occurrences exceed a predefined contain both X and Y out of total no. of minimum support threshold. transactions that contains X. Step 2: Derive association rules from those frequent itemsets (with the E.g. A confidence of 60% means that 60% of constraint of minimum confidence customers who purchased {Milk, Diaper} also threshold). bought {Beer} Understand Apriori algorithm Apriori Algorithm Creating Frequent Sets Let’s define: Ck as a candidate itemset of size k Lk as a frequent itemset of size k Main steps of iteration are: 1. Find frequent itemset Lk-1 (starting from L1) 2. Join step: Ck is generated by joining Lk-1 with itself (cartesian product Lk-1 x Lk-1) 3. Prune step (Apriori Property): Any (k−1) size itemset that is not frequent cannot be a subset of a frequent k size itemset, hence should be removed from Ck 4. Frequent set Lk has been achieved Understand Apriori algorithm Apriori Algorithm Illustrating Apriori Principle Any subset of a frequent itemset must also be frequent. Itemsets that do not meet the minimum support threshold are pruned away. Understand the metrics used to evaluate the appropriateness of association rules Evaluation of Candidate Rules The process of creating association rules is two-staged. First, a set of candidate rule based on frequent itemsets is generated. If {Bread, Egg, Milk, Butter} is the frequent itemset, candidate rules will look like: {Egg, Milk, Butter} {Bread} {Bread, Milk, Butter} {Egg} {Bread, Egg} {Milk, Butter} Etc. Second, the appropriateness of these candidate rules are evaluated using: Confidence Lift Leverage Understand the metrics used to evaluate the appropriateness of association rules Evaluation of Candidate Rules Confidence The measures of certainty or trustworthiness associated with each discovered rule. Mathematically, the percent of transactions that contain both X and Y out of all the transactions that contain X. Confidence( X Y ) = Support( X∪Y) support( X ) E.g. if {bread, eggs, milk} has support of 0.15 and {bread, eggs} also has a support of 0.15, the confidence of rule {bread, eggs} {milk} is 1. This means 100% of the time a customer buys bread and eggs, milk is brought as well. The rule is therefore correct for 100% of the transactions containing bread and eggs. Understand the metrics used to evaluate the appropriateness of association rules Evaluation of Candidate Rules Confidence A relationship may be thought of as interesting when the algorithm identifies the relationship with a measure of confidence greater than or equal to the predefined threshold (i.e. the minimum confidence). Problem with Confidence: Given a rule X Y, confidence considers only the antecedent (X) and the co- occurrence of X and Y. Cannot tell if a rule contains true implication of the relationship or if the rule is purely coincidental. Understand the metrics used to evaluate the appropriateness of association rules Evaluation of Candidate Rules Lift Measures how many times more often X and Y occur together than expected if they are statistically independent of each other. A measure of how X and Y are really related rather than coincidentally happening together. Lift( X ⇒ Y ) = support( X ∪ Y ) support( X ) ∗support( Y ) Lift = 1 if X and Y are statistically independent Lift > 1 indicates the degree of usefulness of the rule A larger value of lift suggests a greater strength of the association between X and Y. Understand the metrics used to evaluate the appropriateness of association rules Evaluation of Candidate Rules Lift E.g. Assuming 1000 transactions, If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400 of the transactions, then Lift(milk eggs) = 0.3 / (0.5 * 0.4) = 1.5 If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400 of the transactions, then Lift(milk bread) = 0.4/(0.5*0.4) = 2.0 Therefore it can be concluded that milk and bread have a stronger association than milk and eggs. Understand the metrics used to evaluate the appropriateness of association rules Evaluation of Candidate Rules Leverage Measure the difference in the probability of X and Y appearing together in the dataset compared to what would be expected if X and Y were statistically independent of each other. Leverage( X Y ) = Support( X ∪ Y ) - Support( X ) ∗ Support( Y ) Leverage = 0 if X and Y are statistically independent Leverage > 0 indicates the degree of relationship between X and Y, A larger leverage value indicates a stronger relationship between X and Y. Understand the metrics used to evaluate the appropriateness of association rules Evaluation of Candidate Rules Leverage E.g. Assuming 1000 transactions, If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400 of the transactions, then Leverage(milk eggs) = 0.3 - 0.5*0.4 = 0.1 If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400 of the transactions, then Leverage (milk bread) = 0.4 - 0.5*0.4 = 0.2 It again confirms that milk and bread have a stronger association than milk and eggs. What are the limitations of confidence? Evaluation of Candidate Rules Confidence is able to identify trustworthy rules, but it cannot tell whether a rule is coincidental. Measures such as lift and leverage not only ensure interesting rules are identified but also filter out the coincidental rules. Support, confidence, lift and leverage ensures the discovery of interesting and strong rules from sample dataset. What are the applications of association rules? Applications of Association Rules The term market basket analysis refers to a specific implementation of association rules. For better merchandising – products to include/exclude from inventory each month Placement of products Cross-selling Promotional programs—multiple product purchase incentives managed through a loyalty card program What are the applications of association rules? Applications of Association Rules Input: the simple point-of-sale transaction data Output: Most frequent affinities among items Example: according to the transaction data… “Customer who bought a laptop computer and a virus protection software, also bought extended service plan 70 percent of the time." How do you use such a pattern/knowledge? Put the items next to each other for ease of finding Promote the items as a package (do not put one on sale if the other(s) are on sale) Place items far apart from each other so that the customer must walk the aisles to search for it, and by doing so potentially seeing and buying other items What are the applications of association rules? Applications of Association Rules Recommender systems – Amazon, Netflix: Clickstream analysis from web usage log files Website visitors to page X click on links A,B,C more than on links D,E,F In medicine: relationships between symptoms and illnesses; diagnosis and patient characteristics and treatments (to be used in medical DSS); genes and their functions (to be used in genomics projects).. What are the ways to improve the efficiency of Apriori? Diagnostics Approaches to improve Apriori’s efficiency: Partitioning: Any itemset that is potentially frequent in a transaction database must be frequent in at least one of the partitions of the transaction database. Sampling: This extracts a subset of the data with a lower support threshold and uses the subset to perform association rule mining. Transaction reduction: A transaction that does not contain frequent k-itemsets is useless in subsequent scans and therefore can be ignored. Hash-based itemset counting: If the corresponding hashing bucket count of a k-itemset is below a certain threshold, the k-itemset cannot be frequent. Dynamic itemset counting: Only add new candidate itemsets when all of their subsets are estimated to be frequent. Model Selection, K-Means Clustering DBSCAN COS10022 Data Science Principles What is a model? Model Selection A model is an abstraction from reality; an attempt to understand the reality. Modeling involves observing certain events happening in a real-world situation and attempting to construct models that emulate this behavior with a set of rules and conditions. COS10022 Data Science Principles – Lecture 06 119 What is a model? Model Selection Often in a model, all extraneous detail has been removed or abstracted. Consequently, we must pay attention to these abstracted details after a model has been analyzed to see what might have been overlooked. In statistical modeling (which we’ll be most concerned with in this unit), the terms ‘model’, ‘algorithm’, ‘analytical technique’, and ‘analytical method’ are often used interchangeably. COS10022 Data Science Principles – Lecture 06 120 What are the differences between supervised, unsupervised and semi-supervised learning? Model Selection COS10022 Data Science Principles – Lecture 06 121 What are the differences between supervised, unsupervised and semi-supervised learning? Model Selection Brownlee, J 2016, ‘Supervised and Unsupervised Machine Learning Algorithms’, Machine Learning Mastery Link: http://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/ Supervised vs Unsupervised vs Semi-supervised models Supervised models take a set of input variables (X) and an output variable (Y) and try to learn the mapping function from the inputs to the output: Training set: Inputs + Outputs (i.e. the correct answers) Test set: Inputs + … (the model needs to guess the correct answers) Two main categories of supervised models: 1. Classification (categorical output, e.g. “Fail”, “Pass”) 2. Regression (real values/numerical output, e.g. 0.05, $250) COS10022 Data Science Principles – Lecture 06 122 What are the differences between supervised, unsupervised and semi-supervised learning? Model Selection Brownlee, J 2016, ‘Supervised and Unsupervised Machine Learning Algorithms’, Machine Learning Mastery Link: http://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/ Classification Regression COS10022 Data Science Principles – Lecture 06 123 What are the differences between supervised, unsupervised and semi-supervised learning? Model Selection Brownlee, J 2016, ‘Supervised and Unsupervised Machine Learning Algorithms’, Machine Learning Mastery Link: http://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/ Supervised vs Unsupervised vs Semi-supervised models Unsupervised models work with a set of input variables (X) but no output variable. The goal is to find the underlying and ‘interesting’ structure or distribution in the data in order to learn more about the data. No training and test set. No correct answer and no “teacher”. Two main categories of unsupervised models: 1. Clustering 2. Association COS10022 Data Science Principles – Lecture 06 124 What are the differences between supervised, unsupervised and semi-supervised learning? Model Selection Brownlee, J 2016, ‘Supervised and Unsupervised Machine Learning Algorithms’, Machine Learning Mastery Link: http://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/ Clustering Dimensional Reduction COS10022 Data Science Principles – Lecture 06 125 What are the differences between supervised, unsupervised and semi-supervised learning? Model Selection Brownlee, J 2016, ‘Supervised and Unsupervised Machine Learning Algorithms’, Machine Learning Mastery Link: http://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/ Supervised vs Unsupervised vs Semi-supervised models Semi-supervised models address situations where there are a set of input variables (X) with partially available values of the output variable (Y). An example is a model that must automatically classify images from a photo archive where only some of the images are labelled (e.g. dog, cat, person) while the majority are not labelled. A possible strategy: Apply a supervised model to train the labelled data, and then Make the best prediction on the unlabeled output using the same supervised model. Once trained, apply the supervised model to make prediction on the test data. COS10022 Data Science Principles – Lecture 06 126 What are the differences between supervised, unsupervised and semi-supervised learning? Model Selection Brownlee, J 2016, ‘Supervised and Unsupervised Machine Learning Algorithms’, Machine Learning Mastery Link: http://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/ Semi-Supervised Learning Workflow Principle of semi-supervised learning: 1. A model (e.g. a classifier) is first trained on the few available labelled training data. 2. This model is then used to classify and thus label the many unlabelled data available. 3. The newly labelled data are combined with the originally available labelled ones to retrain the model with many more data, and thus hopefully to obtain a better model. COS10022 Data Science Principles – Lecture 06 127 What are the objectives of k-means clustering? Exploratory model: K-Means Clustering Exploratory = Unsupervised Clustering methods aim at discovering natural grouping of objects of interests (i.e. customers, images, documents, etc.). Generally, this objective is achieved through: 1. Finding the similarities between the objects based on their attributes/properties/variables. 2. Group similar objects into clusters. COS10022 Data Science Principles – Lecture 06 128 What are the objectives of k-means clustering? Exploratory model: K-Means Clustering Popular methods: k-means clustering Use cases: Customer segmentation (i.e. grouping customers according to the similarity of their behaviors or spending patterns) Automatic identification of abnormal activities in CCTV videos Summarize news articles and many more …. COS10022 Data Science Principles – Lecture 06 129 Understand the k-means clustering algorithm K-Means Clustering: the algorithm K-Means clustering model assumes the following: n=3 A collection of M number of objects or data points with n number of attributes/properties/variables. Pre-determined k number of clusters to be found (i.e. K-Means requires that you decide the number of clusters you need) M = 10 Given a dataset with two attributes (n=2), an object or data point corresponds to a point (xi, yi) in a Cartesian plane, where x and y denote the two attributes and i = 1, 2, …, M. For a given cluster containing m data points (m ≤ M), a centroid is the point in the Cartesian plane which corresponds to mean value of that cluster. COS10022 Data Science Principles – Lecture 06 130 What are steps of k-means clustering? K-Means Clustering: the algorithm The basic algorithm of the K-Means clustering consists of 4 main steps: Step 1: Choose the value of k and the k initial guesses for the centroids (also known as the ‘center of mass’). Step 2: Compute the distance from each data point (xi, yi) to each centroid. Assign each point to the closest centroid. This association defines the first k clusters. Step 3: Compute a new centroid for each cluster defined in Step 2. Step 4: Repeat Step 2 and 3 until the algorithm converges to an answer. The initialization process in Step 1 can be achieved using two different methods: Forgy: set the positions of the k centroids to k randomly chosen data points. Random partition: assign a cluster randomly to each observation and compute the initial centroids in a manner similar to Step 3. COS10022 Data Science Principles – Lecture 06 131 What are steps of k-means clustering? K-Means Clustering: the algorithm Step 1: Choose the value of k and the k initial guesses for the centroids (also known as the ‘center of mass’). Example: Assume k = 3 COS10022 Data Science Principles – Lecture 06 132 What are steps of k-means clustering? K-Means Clustering: the algorithm Step 2: Compute the distance from each data point (xi, yi) to each centroid. Assign each point to the closest centroid. This association defines the first k clusters. Distance d between a data point (xi, yi) to a centroid (xC, yC ) is calculated using the Euclidean distance measure: d = (x − x ) 2 + ( y − y )2 i C i C Centroid Data Point COS10022 Data Science Principles – Lecture 06 133 What are steps of k-means clustering? K-Means Clustering: the algorithm Step 3: Compute a new centroid for each cluster defined in Step 2. For each cluster, the coordinate of its new centroid is calculated as an ordered pair of the arithmetic means of the coordinates of the m data points in that cluster, as follows: m m ∑ i ∑ i x y (x new _ C , y new _ C ) = i=1 , i=1 new centroids m m COS10022 Data Science Principles – Lecture 06 134 What are steps of k-means clustering? K-Means Clustering: the algorithm Step 4: Repeat Step 2 and 3 until the algorithm converges to an answer. * Convergence is reached when the computed centroids no longer change. Observe how the coordinates of the centroids no longer change between Iteration 9 and the Converged! steps. COS10022 Data Science Principles – Lecture 06 135 What is the difference between forgy and random partition methods? Initializing k-means Clustering Forgy Method: Set the positions of the k centroids to k randomly chosen data points. What is the difference between forgy and random partition methods? Initializing k-means Clustering Random Partition Method: Randomly assign each observation to a cluster. Given two points, learn to calculate the distance between them. K-Means Clustering: Distance Calculation Given two points, learn to calculate the distance between them. K-Means Clustering: Distance Calculation Given two points, learn to calculate the distance between them. K-Means Clustering: Distance Calculation Given two points, learn to calculate the distance between them. K-Means Clustering: Distance Calculation What are the methods to determine the best value of k? K-Means Clustering: choosing the value of k As mentioned before, the K-Means clustering model assumes that you already know the ‘right’ number of k clusters to be found before executing the clustering model. In practice, the optimal value of k can be determined by either: a ‘reasonable’ guess; predefined requirements, e.g.