المعلومات حول تعدين البيانات | جامعة الملك سعود

‫الجامعة السعودية االلكترونية‬ ‫الجامعة السعودية االلكترونية‬ ‫‪26/12/2021‬‬ College of Computing and Informatics Data Mining and Data Warehousing Module 1 Chapter 1 (introduction) Chapter 2 (Getting to Know Your Data) Week Learning Outcomes Define data mining and its applications. Recognize kinds of data and patterns that can be mine Describe data objects and attribute types 4 Chapter 1: Introduction (1.1 - 1.5)  Why Data Mining?  What is Data Mining?  What kinds of Data can be Mined?  What kind of Patters Can be Mined?  Which Technologies Are Used?  What Kind of Applications Are Targeted?  Major Issues in Data Mining Chapter 2: Getting to Know Your Data (2.1- 2.3) Data Objects and Attributes Types  Basic Statistical Descriptions of Data  Data Visualization Why Data Mining? The rapid growth of Data Data collection and availability caused by Digital Transformation and Automation Across all fields Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, … Society and everyone: news, digital cameras, YouTube As result of that, we are drowning in data, data mining help finding knowledge within data. “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 6 What Is Data Mining? Data mining has multiple names depending on time and field Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Data mining is a process of extracting knowledge from data regardless of the reasons and what these knowledge is used for. In other words, the extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from data. 7 What Is Data Mining? Watch out! From naming everything Data Mining It is easy to overstate what data mining is and include similar tasks such as: Simple search and query processing Expert Systems that does not rely on data 8 Knowledge Discovery (KDD) Process Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases 9 Knowledge Discovery (KDD) Process Data cleaning (to remove noise and inconsistent data) Data integration (where multiple data sources may be combined) Data selection (where data relevant to the analysis task are retrieved from the database) Data transformation (where data are transformed and consolidated into forms appropriate for mining by performing summary or aggregation operations) Data mining (an essential process where intelligent methods are applied to extract data patterns or knowledge) Pattern evaluation (to identify the truly interesting patterns representing knowledge based on interestingness measures—see Section 1.4.6) Knowledge presentation (where visualization and knowledge representation techniques are used to present mined knowledge to users) 10 What kinds of Data can be Mined? Database-oriented data sets and applications Relational database, data warehouse, transactional database Advanced data sets and advanced applications Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio- sequences) Structure data, graphs, social networks and multi-linked data Object-relational databases Heterogeneous databases and legacy databases Spatial data and spatiotemporal data Multimedia database Text databases The World-Wide Web 11 What kind of Patters Can be Mined? Data Mining Techniques Used 1) Generalization Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region 2) Association and Correlation Analysis Frequent patterns (or frequent itemsets) e.g. What items are frequently purchased together in your supermarket? Association, correlation vs. causality A typical association rule: Bread -> Milk [0.5%, 75%] (support, confidence) Are strongly associated items also strongly correlated? Are all pattern interesting? 12 What kind of Patters Can be Mined? Data Mining Techniques Used 3) Classification (supervised learning) Build a model that can classify objects based on their characteristics. Model will be learn to do so by looking at previous examples (refer to as training data) Applications include credit card fraud detection and patient diagnoses. 4) Cluster Analysis (unsupervised learning) Group data objects together based on their characteristics where objects in a group are similar and objects belong to different groups are dissimilar. Application include grouping online shopper or news readers to better target them with advertisements. 13 What kind of Patters Can be Mined? Data Mining Techniques Used 5) Outlier Analysis Outlier: A data object that does not comply with the general behavior of the data Finding characteristics that a group of data objects share and those do not closely share those characteristics are considered outliers. noise vs. outliers. Outliers analysis can use classification or clustering techniques. Application include transaction fraud detection, and rare event analysis. 14 Which Technologies Are Used? Machine Pattern Statistics Learning Recognition Applications Data Mining Visualization Algorithm Database High-Performance Technology Computing 15 Why Confluence of Multiple Disciplines? Tremendous amount of data Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations 16 Applications of Data Mining Web page analysis: from web page classification, clustering to PageRank & HITS algorithms Collaborative analysis & recommender systems Basket data analysis to targeted marketing Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue) From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server Analysis Manager, Oracle Data Mining Tools) to invisible data mining 17 Major Issues in Data Mining (1) Mining Methodology Mining various and new kinds of knowledge Mining knowledge in multi-dimensional space Data mining: An interdisciplinary effort Boosting the power of discovery in a networked environment Handling noise, uncertainty, and incompleteness of data Pattern evaluation and pattern- or constraint-guided mining User Interaction Interactive mining Incorporation of background knowledge Presentation and visualization of data mining results 18 Major Issues in Data Mining (2) Efficiency and Scalability Efficiency and scalability of data mining algorithms Parallel, distributed, stream, and incremental mining methods Diversity of data types Handling complex types of data Mining dynamic, networked, and global data repositories Data mining and society Social impacts of data mining Privacy-preserving data mining Invisible data mining 19 Data Objects and Attributes Types Data sets are made up of data objects. A data object represents an entity— in a sales database, the objects may be customers, store items, and sales; in a medical database, the objects may be patients; in a university database, the objects may be students, professors, and courses. Data objects are typically described or represented by attributes. Data objects can also be referred to as samples, examples, instances, data points, or objects. 20 Attributes An attribute is a data field, representing a characteristic or feature of a data object. The nouns attribute, dimension, feature, and variable are often used interchangeably in the literature. The term dimension is commonly used in data warehousing. Machine learning literature tends to use the term feature. Statisticians prefer the term variable. Data mining and database professionals commonly use the term attribute, and we do in this course. 21 Attributes Types The type of an attribute is determined by the set of possible values the attribute can have. These are the four types: 1) Nominal Attributes: 2) Binary Attributes 3) Ordinal Attributes 4) Numeric Attributes Interval-Scaled Attributes Ratio-Scaled Attributes 22 Attributes Types The type of an attribute is determined by the set of possible values the attribute can have. These are the four types: 1) Nominal Attributes: Each value represents some kind of category, code, or state, and so nominal attributes are also referred to as categorical. The values do not have any meaningful order. E.g. Hair color, marital status, occupation, ID numbers, zip codes 2) Binary Attributes: A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0 typically means that the attribute is absent, and 1 means that it is present. Binary attributes are referred to as Boolean if the two states correspond to true and false. E.g. Medical test result or gender 23 Attributes Types 3) Ordinal Attributes: an attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known. E.g. Size = {small, medium, large} Grades = {A, B, C, D, F} Army rankings … Etc 24 Attributes Types 4) Numeric Attributes: is quantitative; that is, it is a measurable quantity, represented in integer or real values. Numeric attributes can be interval-scaled or ratio-scaled. Interval-Scaled Attributes are measured on a scale of equal-size units. No true zero-point. E.g. temperature in C˚or F˚, calendar dates. Ratio-Scaled Attributes are numeric attribute with an inherent zero-point. E.g., area, weight, height, length, counts, monetary quantities. Ratio between two data object’s attribute can be calculated. 25 Discrete vs. Continuous Attributes Discrete Attribute (Nominal, Binary and Ordinal) Has only a finite or countably infinite set of values E.g., zip codes, profession, or the set of words in a collection of documents Sometimes, represented as integer variables Continuous or Numeric Attribute (Ratio and Interval) Has real numbers as attribute values E.g., temperature, height, or weight Practically, real values can only be measured and represented using a finite number of digits Continuous attributes are typically represented as floating-point variables 26 Basic Statistical Descriptions of Data Motivation To better understand the data How? by measuring data’s central tendency and distribution (variation and spread). Measuring the Central Tendency characteristics Mean, Median, Mode and Midrange. Measuring the Data dispersion (or distribution) characteristics Range, max, min, quantiles, outliers, variance and standard deviation. 27 Basic Statistical Descriptions of Data Measuring the Central Tendency characteristics Mean Mean is average value for all the data also is the center of data. It 1 n calculated by dividing the sum of all values over the sample size. x   xi n i 1 Trimmed mean The mean can also be calculated on a trimmed data by removing the extreme values. Weighted average or Weighted arithmetic mean Differ from regular mean by giving each value a weight that reflect its significance or importance. n w x i i x  i 1n w i 1 i 28 Measuring the Central Tendency characteristics – Example Suppose we have the following values for salary (in thousands of dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. Mean => 1 n x   xi n i 1 Trimmed mean In this example, remove 30, 36 and 110. Then, recalculate. 47+ 50+52 +52+56 +60+ 63+70+ 70 520 𝑥= = 9 9 29 Basic Statistical Descriptions of Data Measuring the Central Tendency characteristics Median After sorting the data median is: The middle value if the size of data is an odd number The sum of the two middle numbers divided by 2, otherwise. When the number of observation is large, sorting can be computational expensive. However, without sorting we can approximate the value. 30 Basic Statistical Descriptions of Data Measuring the Central Tendency characteristics Median Assuming the data are grouped into intervals according to their x data values and that the frequency of each interval is known Let the interval that contains the median frequency be the median interval The median of the entire data set (e.g., the median salary) can be approximated by interpolation using the formula n / 2  ( freq )l median L1  ( ) width freq median L1 is the lower boundary of the median interval, N is the total number of sample values, is the sum of the frequencies of all of the intervals that are lower than the median interval, freqmedian is the frequency of the median interval, and width is the width of the median interval. 31 Basic Statistical Descriptions of Data Measuring the Central Tendency characteristics Mode is a value that occurs most frequently in the data Sometimes we have multiple values with the same highest frequency. (Unimodal or Multimodel e.g. Bimodal, Trimodal) Only one value with highest frequency = Unimodel Two values with highest and equally frequent values = bimodal Three values with highest and most frequent values = trimodal mean  mode 3 (mean  median) Unimodel mode can also be approximated using 32 Basic Statistical Descriptions of Data Measuring the Central Tendency characteristics Midrange is another measure of central tendency. It is simply the average of the min and max values of the data. This is easy to compute using the SQL aggregate functions, max() and min(). When data have a symmetric distribution all central tendency measure return the same center value. But data usually do not! 33 Measuring the Central Tendency characteristics – Example Suppose we have the following values for salary (in thousands of dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. Median: As the total number of values are 12. The median is the sum of two middle numbers divided by 2 Mode: The values 52 and 70 occur twice in the given data so, 52 and 70 are the modes (bimodal) Midrange: Min value in the data is 30 whereas the max value is 110, thus Midrange = (30 + 110) / 2 = 70 34 Basic Statistical Descriptions of Data Measuring the Central Tendency characteristics Data in real world application tend to have asymmetric data distribution or positively (negatively) skewed data. symmetric positively skewed negatively skewed 35 Basic Statistical Descriptions of Data Measuring the Dispersion (distribution) of Data Quartiles, outliers and boxplots Quartiles: Q1 (25th percentile), Q3 (75th percentile) Inter-quartile range: IQR = Q3 – Q1 Five number summary: min, Q1, median, Q3, max Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually Outlier: usually, a value higher/lower than 1.5 x IQR Variance and standard deviation (sample: s, population: σ) Variance: (algebraic, scalable computation) 1 n 1 n 2   2 N  (x i 1 2 i 2  )  N  x i 1 i   Standard deviation s (or σ) is the square root of variance s2 (or σ2) 36 Dispersion (distribution) of Data Popular visualization plots visualize data distribution Boxplot: graphic display of five-number summary Histogram: x-axis are values, y-axis represent frequencies Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are  xi Quantile-quantile (q-q) plot: graphs the quantiles of one univariate distribution against the corresponding quantiles of another Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane 37 Boxplot Five-number summary of a distribution Minimum, Q1, Median, Q3, Maximum Boxplot Data is represented with a box The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR The median is marked by a line within the box Whiskers: two lines outside the box extended to Minimum and Maximum Outliers: points beyond a specified outlier threshold, plotted individually 38 Histogram Histograms (or frequency histograms) are at least a century old and are widely used. The height of the bar indicates the frequency (i.e., count) of the values that fill within range of the bar. The resulting graph is more commonly known as a bar chart. 39 Quantile plot Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information: For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi 40 Quantile-Quantile (Q-Q) Plot Graphs the quantiles of one univariate distribution against the corresponding quantiles of another View: Is there is a shift in going from one distribution to another? Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit prices of items sold at Branch 1 tend to be lower than those at Branch 2. 41 Scatter Plot Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane 42 Scatter Plot – Correlation Positively Negatively No Correlation Correlated Correlated 43 Data Visualization – Pixel-oriented Data visualization aims to communicate data clearly and effectively through graphical representation. Data visualization has been used extensively in many applications—for example, at work for reporting, managing business operations, and tracking progress of tasks. used to discover data relationships that are otherwise not easily observable by looking at the raw data. Provide a visual proof of computer representations derived 44 Data Visualization Categorization of visualization methods: Pixel-oriented visualization techniques Geometric projection visualization techniques Icon-based visualization techniques Hierarchical visualization techniques Visualizing complex data and relations 45 Pixel-oriented visualization techniques Each window represent an attribute (a feature) Each data object is represented in a pixel on each window The density of pixel color is proportional to the value. Income )a( Credit )b( transaction )c( age )d( Limit volume 46 Laying Out Pixels in Circle Segments To save space and show the connections among multiple dimensions, space filling is often done in a circle segment Representing a data )a( Laying out pixels in circle )b( record in circle segment segment 47 Geometric projection visualization techniques Visualization of geometric transformations and projections of the data Methods Direct visualization Scatterplot and scatterplot matrices Landscapes Projection pursuit technique: Help users find meaningful projections of multidimensional data Prosection views Hyperslice Parallel coordinates 48 Scatterplot matrices and Direct visualization 49 3D Scatterplot and 2D Scatterplot with Cartesian Coordinates 50 Icon-based visualization techniques Visualization of the data values as features of icons Typical visualization methods Chernoff Faces Stick Figures General techniques Shape coding: Use shape to represent certain information encoding Color icons: Use color icons to encode more information Tile bars: Use small icons to represent the relevant feature vectors in document retrieval 51 Icon-based visualization techniques 52 Hierarchical visualization techniques For a large data set with high dimensionality, it is usually difﬁcult to visualize all dimensions simultaneously. Hierarchical visualization techniques partition all dimensions into subsets or subspaces. The subspaces are then visualized in a hierarchical manner. Methods Dimensional Stacking Worlds-within-Worlds Tree-Map Cone Trees InfoCube 53 Hierarchical visualization techniques - Dimensional Stacking attribute 4 attribute 2 attribute 3 attribute 1 Partitioning of the n-dimensional attribute space in 2-D subspaces which are ‘stacked’ into each other Partitioning of the attribute value ranges into classes the important attributes should be used on the outer levels Adequate for data with ordinal attributes of low cardinality But, difficult to display more than nine dimensions Important to map dimensions appropriately February 16, 2025 Data Mining: Concepts and Techniques 54 Hierarchical visualization techniques - Dimensional Stacking Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes 55 Hierarchical visualization techniques – InfoCube A 3-D visualization technique where hierarchical information is displayed as nested semi-transparent cubes The outermost cubes correspond to the top level data, while the subnodes or the lower level data are represented as smaller cubes inside the outermost cubes, and so on 56 Visualizing complex data and relations Visualizing non-numerical data: text and social networks Tag cloud: visualizing user-generated tags The importance of tag is represented by font size/color Besides text data, there are also methods to visualize relationships, such as visualizing social networks Newsmap: Google News Stories in 57 Required Reading 1. Data Mining: Concepts and Techniques, Chapter 1 (introduction) 2. Data Mining: Concepts and Techniques, Chapter 2 (Getting to Know Your Data) Recommended Reading 1. Data Mining: Practical Machine Learning Tools and Techniques Chapter 1 (What is it all about?) 2. Data Mining: Practical Machine Learning Tools and Techniques Chapter 2 (Input: concept, instances, attributes) sentation is mainly dependent on the textbook: Data Mining: Concepts and Techniques (3rd ed.) Thank You 59

المعلومات حول تعدين البيانات | جامعة الملك سعود

Document Details

Tags

Related

Summary

Full Transcript