Topic 6b Similarity Lec PDF

TOPIC 6 – PART 2 CLUSTERING: SIMILARITY OBJECTIVES To introduce the basic concepts of clustering To discuss how to compute the dissimilarity between objects of different attribute types ✅ To examine several clustering techniques Partitioning approach Hierarchical approach https://discuss.cryosparc.com/t/using-particles-from-cluster-mode-in-3d-va-for-refinement- fails/3665/2 SIMILARITY AND DISSIMILARITY Similarity § Numerical measure of how alike two data objects are § Value is higher when objects are more alike § Often falls in the range [0,1] Dissimilarity (e.g., distance) § Numerical measure of how different two data objects are § Lower when objects are more alike § Minimum dissimilarity is often 0 § Upper limit varies Proximity refers to a similarity or dissimilarity Used to create cluster A cluster is a collection of data objects such that the objects within a cluster are similar to one another and dissimilar to the objects in other clusters. MEASURE THE QUALITY OF CLUSTERING Dissimilarity/Similarity metric Similarity § Similarity is expressed in terms of a § Numerical measure of how distance function, typically metric: d(i, j) alike two data objects are § The definitions of distance functions § Value is higher when objects are more alike are usually rather different variables: § Often falls in the range [0,1] numeric/boolean/categorical/ordinal ratio/ vector Quality of clustering: Dissimilarity (e.g., distance) § Numerical measure of how It is hard to define “similar enough” different two data objects are or “good enough” for a cluster § Lower when objects are more alike The answer is typically highly subjective § Minimum dissimilarity is often 0 § Upper limit varies DATA MATRIX AND DISSIMILARITY MATRIX Number of rows Number of columns represents n objects represents p attributes Suppose that we have n objects (such as persons, Data matrix 1 … f … P items, or courses) § This structure stores the n data objects in the form of a 1 éx... x1f... x1p ù described by p attributes relational table, or 11 ê.. ê... ú § n-by-p matrix (n number of objects × p number of............ ú (also called measurements or attributes) i ê xi1... xif... xip ú ê ú features), such as age, § So, the matrix represents n data points with p dimensions.. ê............... ú height, weight, or gender. § Two modes ê... xnp úú n êë xn1... xnf û § made up of two entities or “things”, § namely rows (for objects) and columns (for attributes) The objects are: x1 = (x11, x12,... , x1p), 1 2 … n x2 = (x21, x22,... , x2p), Dissimilarity matrix § Used to store dissimilarity values for pairs of objects 1 and so on, where xij is the § This stores a collection of proximities that are available for all 2 value for object xi of the pairs of n objects. It is often represented by an n-by-n table: jth attribute. 3 § n data points, but registers only the difference d(i, j) becomes larger the more they differ... § A triangular matrix n § Single mode (contains one kind of entity) NOMINAL ATTRIBUTE A nominal attribute can take on two or more states e.g., red, yellow, blue, green The dissimilarity between two objects i and j: Method 1: Simple matching m: # of matches, p: total # of variables d (i, j ) = p - m m is the number of matches p p is the total number of attributes describing the objects Method 2: Use a large number of binary attributes creating a new binary attribute for each of the M nominal states NOMINAL ATTRIBUTE Object identifier Zone Code é 0 ù (nominal) Form êd(2,1) 0 ú 1 Code-A dissimilarity ê ú 2 Code-B matrix by êd(3,1) d (3,2) 0 ú calculating the ê ú 3 Code-C distance êd (4,1) d (4,2) d (4,3) 0 ú 4 Code-A êë úû d (i, j) = p - m =? m is the number of matches p p is the total number of attributes é0 ù describing the objects ê1 0 ú m = 0, because Zone Code for ê ú 1−0 object 1 and 2 is different ê1 1 0 ú 𝑑 2,1 = =1 ê ú 1 p = 1, because only zone code is the ê0 1 1 0 ú attribute, so the total number of êë úû attributes is equal to 1 NOMINAL ATTRIBUTES: more than 1 attribute m is the number of matches Object Zone Task Code é 0 ù p is the total number of attributes identifier Code (nominal) êd(2,1) ú 0 describing the objects (nominal) ê ú êd(3,1) d (3,2) 0 ú In this example, p = 2 for attributes 1 Code-A Code-P ê ú zone code and task code êd (4,1) d (4,2) d (4,3) 0 ú d (i, j) = p - 2 Code-B Code-P êë úû m =? 3 Code-C Code-R p 4 Code-A Code-Q The distance is calculated based on 2 attributes. Task Code Zone and Task Code Zone Code 0 0 0 1 0 0 0 0.5 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 1 1 0 0.5 1 1 0 NOMINAL ATTRIBUTES: more than 1 attribute Object Zone Task Code d (i, j) = p - p m =? identifi Code (nominal) Dissimilarity Matrix er (nominal) 1 Code-A Code-P é 0 ù d(2,1) = 2-1 / 2 = 0.5 êd(2,1) 0 ú 2 Code-B Code-P ê êd(3,1) d (3,2) 0 ú ú d(3,1) = 2-0 / 2 = 1 ê ú 3 Code-C Code-R êd (4,1) d (4,2) d (4,3) 0 ú d(4,1) = 2-1 / 2 = 0.5 êë úû 4 Code-A Code-Q d(3,2) = 2-0 / 2 = 1 d(4,2) = 2-0 / 2 = 1 Zone and Task Code d(4,3) = 2-0 / 2 = 1 0 How do you get the distance? 0.5 0 1 1 0 0.5 1 1 0 BINARY ATTRIBUTE Nominal attribute with only 2 states (0 and 1) A contingency table for binary data where q is the number of attributes that equal 1 for both objects i and j, r is the number of attributes that equal 1 for object i but that are 0 for object j, s is the number of attributes that equal 0 for object i but equal 1 for object j, and t is the number of attributes that equal 0 for both objects i and j. The total number of attributes is p, where p = q + r + s + t. Object j CONTIGENCY TABLE for Object i and j Object i BINARY ATTRIBUTE Object j Distance measure for symmetric binary variables (if the outcomes of each state is equally valuable): t is unimportant Object i d(i,j) = (r+s) / (q+r+s+t) Example of symmetric is gender (whether female or male) Distance measure for asymmetric binary variables (if the outcomes of the states are not equally important): d(i,j) = (r+s) / (q+r+s) Example of asymmetric is medical test (positive is more important) Jaccard coefficient (similarity measure for asymmetric binary variables): sim (i,j) = q/(q+r+s) = 1 – d(i,j) BINARY ATTRIBUTES CONTIGENCY TABLE for Object i and j Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Object j Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N Object i gender is a symmetric attribute the remaining attributes d(i,j) = (r+s) / (q+r+s) are asymmetric binary let the values Y and P be set to 1, and the value N be set to 0 0 +1 d ( jack , mary ) = = 0.33 Name Fever Cough Test-1 Test-2 Test-3 Test-4 2 + 0 +1 Jack 1 0 1 0 0 0 1+1 Mary 1 0 1 0 1 0 d ( jack , jim) = = 0.67 Jim 1 1 0 0 0 0 1+1+1 1+ 2 d ( jim, mary ) = = 0.75 1+1+ 2 BINARY ATTRIBUTES CONTIGENCY TABLE Let’s find the distance between Jack and Mary. for Object i and j Name Fever Cough Test-1 Test-2 Test-3 Test-4 Object j Jack 1 0 1 0 0 0 Mary 1 0 1 0 1 0 Object i Jim 1 1 0 0 0 0 1 - Tabulate the contingency table for Jack and Mary. 2- Apply the formula. Jack/Mary 1 0 1 q=2 r=0 d(i,j) = (r+s) / (q+r+s) 0 s=1 t=3 Repeat to get d(Jack, Jim) and d(Mary, Jim). DISTANCE ON NUMERIC ATTRIBUTE Most popular distance measure is Euclidean distance Let i = (xi1, xi2,... , xip) and j = (xj1, xj2,... , xjp) be two objects which are i and j, described by p numeric attributes. The Euclidean distance between 𝑑 𝑖, 𝑗 = $ $ $ 𝑥!" − 𝑥#" + 𝑥!$ − 𝑥#$ + ⋯ + 𝑥!% − 𝑥#% objects i and j is defined as Manhattan ((or city block) distance 𝑑 𝑖, 𝑗 = 𝑥!" − 𝑥#" + 𝑥!$ − 𝑥#$ + … + 𝑥!% − 𝑥#% Minkowski distance: A popular distance measure ! & & & 𝑑 𝑖, 𝑗 = 𝑥!" − 𝑥#" + 𝑥!$ − 𝑥#$ + …. + 𝑥!% − 𝑥#% where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is the order Properties ◦ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness) ◦ d(i, j) = d(j, i) (Symmetry) ◦ d(i, j) £ d(i, k) + d(k, j) (Triangle Inequality) NUMERIC ATTRIBUTE Data Matrix point attribute1 attribute2 x2 x4 x1 1 2 x2 3 5 4 x3 2 0 x4 4 5 Dissimilarity Matrix (with Euclidean Distance) 2 x1 x1 x2 x3 x4 x1 0 x2 3.61 0 x3 ?? 5.1 0 x3 x4 4.24 ?? 5.39 0 0 2 4 $ $ 𝑑 𝑥1, 𝑥2 = 1−3 + 2−5 = 3.61 Find d(x3,x1) and d(x4,x2). ORDINAL ATTRIBUTE An ordinal variable can be discrete or continuous Order is important, e.g., rank, can be treated like interval-scaled Grade (e.g., A+, A, A-, B+, B, B-, C+, C, C-, D+, D, E, F) Suppose that f is an attribute from a set of ordinal attributes describing n objects. The dissimilarity computation with respect to f involves the following steps: The value of f for the ith object is xif , and f has Mf ordered states, representing the ranking 1,... , Mf. Replace xif by their rank 𝑟!' ∈ {1,... , 𝑀' } Since each ordinal attribute can have a different number of states, it is often necessary to map the range of each attribute onto [0.0,1.0] so that each attribute has equal weight. We perform such data normalization by replacing the rank rif of the ith object in the fth attribute by 𝑟!" − 1 𝑧!" = 𝑀" − 1 Compute the dissimilarity using methods for interval-scaled (numeric) variables ORDINAL ATTRIBUTE 𝑟!" − 1 𝑧!" = Object Lab test-2 Lab test-2 Lab test-2 𝑀" − 1 (ordinal) identifier (ordinal) (ordinal) *map to rank ( *normalize the Find distance increasing order) rank within [0,1] using Euclidean : (𝑧!" ) d(2,1) d(3,1) 1 excellent 3 = (3-1)/(3-1) = 1 d(4,1) 2 fair 1 = (1-1)/(3-1) = 0 …… 3 good 2 = (2-1)/(3-1) = 0.5 d(4,3) 4 excellent 3 = (3-1)/(3-1) = 1 0 ? 0 $ $ $ ? ? 0 𝑑 𝑖, 𝑗 = 𝑥!" − 𝑥#" + 𝑥!$ − 𝑥#$ + ⋯ + 𝑥!% − 𝑥#% ? ? ? 0 ORDINAL ATTRIBUTE Object Lab test-2 Lab test-2 Lab test-2 identifi (ordinal) (ordinal) *map to (ordinal) d(2,1) = (0 − 1)7 er rank ( *normalize the d(3,1) = (0.5 − 1)7 increasing rank within [0,1] order) d(4,1) = (1 − 1)7 1 excellent 3 = (3-1)/(3-1) = 1 …… 2 fair 1 = (1-1)/(3-1) = 0 3 good 2 = (2-1)/(3-1) = 0.5 d(4,3) = (1 − 0.5)7 4 excellent 3 = (3-1)/(3-1) = 1 0 1.0 0 Distance measure is Euclidean distance. 0.25 0 5. 0.25 0 5 0 0 1.0 0.25 0.5 0 $ $ $ 𝑑 𝑖, 𝑗 = 𝑥!" − 𝑥#" + 𝑥!$ − 𝑥#$ + ⋯ + 𝑥!% − 𝑥#% ORDINAL ATTRIBUTE Object Lab test-2 Lab test-2 Lab test-2 identifi (ordinal) (ordinal) *map to (ordinal) d(2,1) = (0 − 1)7 er rank ( *normalize the d(3,1) = (0.5 − 1)7 increasing rank within [0,1] order) d(4,1) = (1 − 1)7 1 excellent 3 = (3-1)/(3-1) = 1 …… 2 fair 1 = (1-1)/(3-1) = 0 3 good 2 = (2-1)/(3-1) = 0.5 d(4,3) = (1 − 0.5)7 4 excellent 3 = (3-1)/(3-1) = 1 0 1.0 0 Distance measure is Euclidean distance. 0.25 0 5. 0.25 0 5 0 0 1.0 0.25 0.5 0 $ $ $ 𝑑 𝑖, 𝑗 = 𝑥!" − 𝑥#" + 𝑥!$ − 𝑥#$ + ⋯ + 𝑥!% − 𝑥#% COSINE SIMILARITY A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document. Other vector objects: gene features in micro-arrays, … Applications: information retrieval, biologic taxonomy, gene feature mapping,... Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then cos(d1, d2) = (d1 d2) /||d1|| ||d2|| , where indicates vector dot product, ||d||: the length of vector d COSINE SIMILARITY cos(d1, d2) = (d1 d2) /||d1|| ||d2|| , where indicates vector dot product, ||d||: the length of vector d (Euclidean norm) Euclidean norm = 𝒙𝟏 𝟐 + 𝒙𝟐 𝟐 + ⋯ + 𝒙𝒏 𝟐 Example : Find the similarity between documents 1 and 2. d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d1 d2 = 5*3 + 0*0 + 3*2 + 0*0 + 2*1 + 0*1 + 0*1 + 2*1 + 0*0 + 0*1 = 25 ||d1||= 𝟓𝟐 + 𝟎𝟐 + 𝟑𝟐 + 𝟎𝟐 + 𝟐𝟐 + 𝟎𝟐 + 𝟎𝟐 + +𝟐𝟐 + 𝟎𝟐 + 𝟎𝟐 = 𝟒𝟐 = 6.481 ||d2||= 𝟑𝟐 + 𝟎𝟐 + 𝟐𝟐 + 𝟎𝟐 + 𝟏𝟐 + 𝟏𝟐 + 𝟎𝟐 + +𝟏𝟐 + 𝟎𝟐 + 𝟏𝟐 = 𝟏𝟕 = 4.12 𝟐𝟓 𝐜𝐨𝐬 𝒅𝟏 , 𝒅𝟐 = = 𝟎. 𝟗𝟒 (𝟔.𝟒𝟖𝟏 ∗𝟒.𝟏𝟐) References 1. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 3rd Edition, Morgan Kaufmann, 2012. 2. Pang-Ning Tan, Michael Steinbach & Vipin Kumar, Introduction to Data Mining, Addison Wesley, 2019. THANK YOU Assoc. Prof. Dr Shuzlina Abdul Rahman | Dr Sofianita Mutalib | Siti Nur Kamaliah Kamarudin

Topic 6b Similarity Lec PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue