Topic 6b Similarity Lec PDF
Document Details
Uploaded by StrikingSodalite
Tags
Summary
This document discusses clustering, similarity, and dissimilarity measures. It explains how to calculate dissimilarity between objects with different attribute types, and examines various clustering approaches. The document also details nominal and binary attributes, their dissimilarity measures, and data matrix representations.
Full Transcript
TOPIC 6 – PART 2 CLUSTERING: SIMILARITY OBJECTIVES To introduce the basic concepts of clustering To discuss how to compute the dissimilarity between objects of different attribute types ✅ To examine several clustering techniques Partitioning approa...
TOPIC 6 – PART 2 CLUSTERING: SIMILARITY OBJECTIVES To introduce the basic concepts of clustering To discuss how to compute the dissimilarity between objects of different attribute types ✅ To examine several clustering techniques Partitioning approach Hierarchical approach https://discuss.cryosparc.com/t/using-particles-from-cluster-mode-in-3d-va-for-refinement- fails/3665/2 SIMILARITY AND DISSIMILARITY Similarity § Numerical measure of how alike two data objects are § Value is higher when objects are more alike § Often falls in the range [0,1] Dissimilarity (e.g., distance) § Numerical measure of how different two data objects are § Lower when objects are more alike § Minimum dissimilarity is often 0 § Upper limit varies Proximity refers to a similarity or dissimilarity Used to create cluster A cluster is a collection of data objects such that the objects within a cluster are similar to one another and dissimilar to the objects in other clusters. MEASURE THE QUALITY OF CLUSTERING Dissimilarity/Similarity metric Similarity § Similarity is expressed in terms of a § Numerical measure of how distance function, typically metric: d(i, j) alike two data objects are § The definitions of distance functions § Value is higher when objects are more alike are usually rather different variables: § Often falls in the range [0,1] numeric/boolean/categorical/ordinal ratio/ vector Quality of clustering: Dissimilarity (e.g., distance) § Numerical measure of how It is hard to define “similar enough” different two data objects are or “good enough” for a cluster § Lower when objects are more alike The answer is typically highly subjective § Minimum dissimilarity is often 0 § Upper limit varies DATA MATRIX AND DISSIMILARITY MATRIX Number of rows Number of columns represents n objects represents p attributes Suppose that we have n objects (such as persons, Data matrix 1 … f … P items, or courses) § This structure stores the n data objects in the form of a 1 éx... x1f... x1p ù described by p attributes relational table, or 11 ê.. ê... ú § n-by-p matrix (n number of objects × p number of............ ú (also called measurements or attributes) i ê xi1... xif... xip ú ê ú features), such as age, § So, the matrix represents n data points with p dimensions.. ê............... ú height, weight, or gender. § Two modes ê... xnp úú n êë xn1... xnf û § made up of two entities or “things”, § namely rows (for objects) and columns (for attributes) The objects are: x1 = (x11, x12,... , x1p), 1 2 … n x2 = (x21, x22,... , x2p), Dissimilarity matrix § Used to store dissimilarity values for pairs of objects 1 and so on, where xij is the § This stores a collection of proximities that are available for all 2 value for object xi of the pairs of n objects. It is often represented by an n-by-n table: jth attribute. 3 § n data points, but registers only the difference d(i, j) becomes larger the more they differ... § A triangular matrix n § Single mode (contains one kind of entity) NOMINAL ATTRIBUTE A nominal attribute can take on two or more states e.g., red, yellow, blue, green The dissimilarity between two objects i and j: Method 1: Simple matching m: # of matches, p: total # of variables d (i, j ) = p - m m is the number of matches p p is the total number of attributes describing the objects Method 2: Use a large number of binary attributes creating a new binary attribute for each of the M nominal states NOMINAL ATTRIBUTE Object identifier Zone Code é 0 ù (nominal) Form êd(2,1) 0 ú 1 Code-A dissimilarity ê ú 2 Code-B matrix by êd(3,1) d (3,2) 0 ú calculating the ê ú 3 Code-C distance êd (4,1) d (4,2) d (4,3) 0 ú 4 Code-A êë úû d (i, j) = p - m =? m is the number of matches p p is the total number of attributes é0 ù describing the objects ê1 0 ú m = 0, because Zone Code for ê ú 1−0 object 1 and 2 is different ê1 1 0 ú 𝑑 2,1 = =1 ê ú 1 p = 1, because only zone code is the ê0 1 1 0 ú attribute, so the total number of êë úû attributes is equal to 1 NOMINAL ATTRIBUTES: more than 1 attribute m is the number of matches Object Zone Task Code é 0 ù p is the total number of attributes identifier Code (nominal) êd(2,1) ú 0 describing the objects (nominal) ê ú êd(3,1) d (3,2) 0 ú In this example, p = 2 for attributes 1 Code-A Code-P ê ú zone code and task code êd (4,1) d (4,2) d (4,3) 0 ú d (i, j) = p - 2 Code-B Code-P êë úû m =? 3 Code-C Code-R p 4 Code-A Code-Q The distance is calculated based on 2 attributes. Task Code Zone and Task Code Zone Code 0 0 0 1 0 0 0 0.5 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 1 1 0 0.5 1 1 0 NOMINAL ATTRIBUTES: more than 1 attribute Object Zone Task Code d (i, j) = p - p m =? identifi Code (nominal) Dissimilarity Matrix er (nominal) 1 Code-A Code-P é 0 ù d(2,1) = 2-1 / 2 = 0.5 êd(2,1) 0 ú 2 Code-B Code-P ê êd(3,1) d (3,2) 0 ú ú d(3,1) = 2-0 / 2 = 1 ê ú 3 Code-C Code-R êd (4,1) d (4,2) d (4,3) 0 ú d(4,1) = 2-1 / 2 = 0.5 êë úû 4 Code-A Code-Q d(3,2) = 2-0 / 2 = 1 d(4,2) = 2-0 / 2 = 1 Zone and Task Code d(4,3) = 2-0 / 2 = 1 0 How do you get the distance? 0.5 0 1 1 0 0.5 1 1 0 BINARY ATTRIBUTE Nominal attribute with only 2 states (0 and 1) A contingency table for binary data where q is the number of attributes that equal 1 for both objects i and j, r is the number of attributes that equal 1 for object i but that are 0 for object j, s is the number of attributes that equal 0 for object i but equal 1 for object j, and t is the number of attributes that equal 0 for both objects i and j. The total number of attributes is p, where p = q + r + s + t. Object j CONTIGENCY TABLE for Object i and j Object i BINARY ATTRIBUTE Object j Distance measure for symmetric binary variables (if the outcomes of each state is equally valuable): t is unimportant Object i d(i,j) = (r+s) / (q+r+s+t) Example of symmetric is gender (whether female or male) Distance measure for asymmetric binary variables (if the outcomes of the states are not equally important): d(i,j) = (r+s) / (q+r+s) Example of asymmetric is medical test (positive is more important) Jaccard coefficient (similarity measure for asymmetric binary variables): sim (i,j) = q/(q+r+s) = 1 – d(i,j) BINARY ATTRIBUTES CONTIGENCY TABLE for Object i and j Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Object j Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N Object i gender is a symmetric attribute the remaining attributes d(i,j) = (r+s) / (q+r+s) are asymmetric binary let the values Y and P be set to 1, and the value N be set to 0 0 +1 d ( jack , mary ) = = 0.33 Name Fever Cough Test-1 Test-2 Test-3 Test-4 2 + 0 +1 Jack 1 0 1 0 0 0 1+1 Mary 1 0 1 0 1 0 d ( jack , jim) = = 0.67 Jim 1 1 0 0 0 0 1+1+1 1+ 2 d ( jim, mary ) = = 0.75 1+1+ 2 BINARY ATTRIBUTES CONTIGENCY TABLE Let’s find the distance between Jack and Mary. for Object i and j Name Fever Cough Test-1 Test-2 Test-3 Test-4 Object j Jack 1 0 1 0 0 0 Mary 1 0 1 0 1 0 Object i Jim 1 1 0 0 0 0 1 - Tabulate the contingency table for Jack and Mary. 2- Apply the formula. Jack/Mary 1 0 1 q=2 r=0 d(i,j) = (r+s) / (q+r+s) 0 s=1 t=3 Repeat to get d(Jack, Jim) and d(Mary, Jim). DISTANCE ON NUMERIC ATTRIBUTE Most popular distance measure is Euclidean distance Let i = (xi1, xi2,... , xip) and j = (xj1, xj2,... , xjp) be two objects which are i and j, described by p numeric attributes. The Euclidean distance between 𝑑 𝑖, 𝑗 = $ $ $ 𝑥!" − 𝑥#" + 𝑥!$ − 𝑥#$ + ⋯ + 𝑥!% − 𝑥#% objects i and j is defined as Manhattan ((or city block) distance 𝑑 𝑖, 𝑗 = 𝑥!" − 𝑥#" + 𝑥!$ − 𝑥#$ + … + 𝑥!% − 𝑥#% Minkowski distance: A popular distance measure ! & & & 𝑑 𝑖, 𝑗 = 𝑥!" − 𝑥#" + 𝑥!$ − 𝑥#$ + …. + 𝑥!% − 𝑥#% where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is the order Properties ◦ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness) ◦ d(i, j) = d(j, i) (Symmetry) ◦ d(i, j) £ d(i, k) + d(k, j) (Triangle Inequality) NUMERIC ATTRIBUTE Data Matrix point attribute1 attribute2 x2 x4 x1 1 2 x2 3 5 4 x3 2 0 x4 4 5 Dissimilarity Matrix (with Euclidean Distance) 2 x1 x1 x2 x3 x4 x1 0 x2 3.61 0 x3 ?? 5.1 0 x3 x4 4.24 ?? 5.39 0 0 2 4 $ $ 𝑑 𝑥1, 𝑥2 = 1−3 + 2−5 = 3.61 Find d(x3,x1) and d(x4,x2). ORDINAL ATTRIBUTE An ordinal variable can be discrete or continuous Order is important, e.g., rank, can be treated like interval-scaled Grade (e.g., A+, A, A-, B+, B, B-, C+, C, C-, D+, D, E, F) Suppose that f is an attribute from a set of ordinal attributes describing n objects. The dissimilarity computation with respect to f involves the following steps: The value of f for the ith object is xif , and f has Mf ordered states, representing the ranking 1,... , Mf. Replace xif by their rank 𝑟!' ∈ {1,... , 𝑀' } Since each ordinal attribute can have a different number of states, it is often necessary to map the range of each attribute onto [0.0,1.0] so that each attribute has equal weight. We perform such data normalization by replacing the rank rif of the ith object in the fth attribute by 𝑟!" − 1 𝑧!" = 𝑀" − 1 Compute the dissimilarity using methods for interval-scaled (numeric) variables ORDINAL ATTRIBUTE 𝑟!" − 1 𝑧!" = Object Lab test-2 Lab test-2 Lab test-2 𝑀" − 1 (ordinal) identifier (ordinal) (ordinal) *map to rank ( *normalize the Find distance increasing order) rank within [0,1] using Euclidean : (𝑧!" ) d(2,1) d(3,1) 1 excellent 3 = (3-1)/(3-1) = 1 d(4,1) 2 fair 1 = (1-1)/(3-1) = 0 …… 3 good 2 = (2-1)/(3-1) = 0.5 d(4,3) 4 excellent 3 = (3-1)/(3-1) = 1 0 ? 0 $ $ $ ? ? 0 𝑑 𝑖, 𝑗 = 𝑥!" − 𝑥#" + 𝑥!$ − 𝑥#$ + ⋯ + 𝑥!% − 𝑥#% ? ? ? 0 ORDINAL ATTRIBUTE Object Lab test-2 Lab test-2 Lab test-2 identifi (ordinal) (ordinal) *map to (ordinal) d(2,1) = (0 − 1)7 er rank ( *normalize the d(3,1) = (0.5 − 1)7 increasing rank within [0,1] order) d(4,1) = (1 − 1)7 1 excellent 3 = (3-1)/(3-1) = 1 …… 2 fair 1 = (1-1)/(3-1) = 0 3 good 2 = (2-1)/(3-1) = 0.5 d(4,3) = (1 − 0.5)7 4 excellent 3 = (3-1)/(3-1) = 1 0 1.0 0 Distance measure is Euclidean distance. 0.25 0 5. 0.25 0 5 0 0 1.0 0.25 0.5 0 $ $ $ 𝑑 𝑖, 𝑗 = 𝑥!" − 𝑥#" + 𝑥!$ − 𝑥#$ + ⋯ + 𝑥!% − 𝑥#% ORDINAL ATTRIBUTE Object Lab test-2 Lab test-2 Lab test-2 identifi (ordinal) (ordinal) *map to (ordinal) d(2,1) = (0 − 1)7 er rank ( *normalize the d(3,1) = (0.5 − 1)7 increasing rank within [0,1] order) d(4,1) = (1 − 1)7 1 excellent 3 = (3-1)/(3-1) = 1 …… 2 fair 1 = (1-1)/(3-1) = 0 3 good 2 = (2-1)/(3-1) = 0.5 d(4,3) = (1 − 0.5)7 4 excellent 3 = (3-1)/(3-1) = 1 0 1.0 0 Distance measure is Euclidean distance. 0.25 0 5. 0.25 0 5 0 0 1.0 0.25 0.5 0 $ $ $ 𝑑 𝑖, 𝑗 = 𝑥!" − 𝑥#" + 𝑥!$ − 𝑥#$ + ⋯ + 𝑥!% − 𝑥#% COSINE SIMILARITY A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document. Other vector objects: gene features in micro-arrays, … Applications: information retrieval, biologic taxonomy, gene feature mapping,... Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then cos(d1, d2) = (d1 d2) /||d1|| ||d2|| , where indicates vector dot product, ||d||: the length of vector d COSINE SIMILARITY cos(d1, d2) = (d1 d2) /||d1|| ||d2|| , where indicates vector dot product, ||d||: the length of vector d (Euclidean norm) Euclidean norm = 𝒙𝟏 𝟐 + 𝒙𝟐 𝟐 + ⋯ + 𝒙𝒏 𝟐 Example : Find the similarity between documents 1 and 2. d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d1 d2 = 5*3 + 0*0 + 3*2 + 0*0 + 2*1 + 0*1 + 0*1 + 2*1 + 0*0 + 0*1 = 25 ||d1||= 𝟓𝟐 + 𝟎𝟐 + 𝟑𝟐 + 𝟎𝟐 + 𝟐𝟐 + 𝟎𝟐 + 𝟎𝟐 + +𝟐𝟐 + 𝟎𝟐 + 𝟎𝟐 = 𝟒𝟐 = 6.481 ||d2||= 𝟑𝟐 + 𝟎𝟐 + 𝟐𝟐 + 𝟎𝟐 + 𝟏𝟐 + 𝟏𝟐 + 𝟎𝟐 + +𝟏𝟐 + 𝟎𝟐 + 𝟏𝟐 = 𝟏𝟕 = 4.12 𝟐𝟓 𝐜𝐨𝐬 𝒅𝟏 , 𝒅𝟐 = = 𝟎. 𝟗𝟒 (𝟔.𝟒𝟖𝟏 ∗𝟒.𝟏𝟐) References 1. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 3rd Edition, Morgan Kaufmann, 2012. 2. Pang-Ning Tan, Michael Steinbach & Vipin Kumar, Introduction to Data Mining, Addison Wesley, 2019. THANK YOU Assoc. Prof. Dr Shuzlina Abdul Rahman | Dr Sofianita Mutalib | Siti Nur Kamaliah Kamarudin