Data Science COMP5122M Data Linkage PDF

Summary

This document is a presentation on data linkage, covering the world of data, examples, process, spatial linkage, tools, and benefits of linking data, along with a history of data linkage, linking variables, record linkage process, data cleaning and standardization, blocking methods, comparison and classification, classification methods (deterministic and probabilistic), evaluation, and privacy-preserving record linkage. It also includes mentions of existing tools for data linkage such as algorithms for deterministic linkage that are used in Excel,R, Python, and other programming languages, also mentioned are other tools like RecordLinkage (R package), Tailor, Febrl.

Full Transcript

Data Science COMP5122M Data Linkage Anna Palczewska & Roy Ruddle COMP5122M Data Science Overview The world of data Data linkage examples Process of data linkage Spatial linkage...

Data Science COMP5122M Data Linkage Anna Palczewska & Roy Ruddle COMP5122M Data Science Overview The world of data Data linkage examples Process of data linkage Spatial linkage Tools COMP5122M Data Science The world of data COMP5122M Data Science Making use of data Individual patient Best Treatment=? Prognosis=? Diagnosis=? identification Precision cohort Group Outcome analysis Treatment comparison Disease progression COMP5122M Data Science Making use of data – other examples Health Care Life Sciences (drug discovery) A hospital admissions, GP practices AT biological, chemical and D health centres, social centres, toxicological activity schools, pharmacies L E environmental studies B patinet’s information in-vitro\in-vivo tests L I A hospital trials Social Sciences A A V Economics N K (crime analysis) customer satisfaction, L I geographic profiling geodemographic customer behaviour geodemographic information information competitors plans offender and victim analysis social media analysis COMP5122M Data Science Data linkage Data linkage is when information from two or more records of independent sources are brought together, when they are perceived to belong to the same individual, family, event or place. Other names for data linkage: data matching, record linkage object identification, identifying uncertainty merge-purge process, entity resolution COMP5122M Data Science Why data linkage? Data source cleaning (removing duplicates) – de-duplication, internal data linkage Merge records into larger datasets Clean and enrich data for mining and analysis Create person-oriented statistics (longitudinal study) Geocode matching for spatial analysis of health and geographical information. COMP5122M Data Science Benefits of linking data Improved data quality and integrity Making better use of available data Privacy and consent Communication benefits Research benefits COMP5122M Data Science History of data linkage 1946 Halbert L. Dunn – “Record Linkage” in American Journal of Public Health 1959 Howard Borden Newcombe, Automatic Linkage of Vital Records, Science 1969 Ivan Fellegi and Alan Sunter, The Theory of Record Linkage, Journal of the American Statistical Association COMP5122M Data Science Linking variables Problems: not unique Unique identifiers and not present in all Names databases Addresses DoB Gender Ethnicity Time a combination of Geographical location variables Picture Description … COMP5122M Data Science Process of record linkage ID NN Name DoB Address PostCode GP Practice 23 222-2 David 12/08/1976 10 Lake Road LS1 1OP E12345 Smith ID NN Name DoB Address PostCode A&E 01 222-2 David 12 Aug 1076 Flat 10 Lake LS1 1OP LS123 Smith Road 01 Dave 12/08/1976 LS1 1OP LS11 Smith COMP5122M Data Science Process of record linkage COMP5122M Data Science Data cleaning and standardisation ID NN Name DoB Address PostCode GP Practice 23 222-2 David 12/08/1976 10 Lake Road LS1 1OP E12345 Smith ID NN Name DoB Address PostCode A&E 01 222-2 David 12 Aug 1076 Flat 10 Lake LS1 1OP LS123 Smith Road 01 Dave 12/08/1976 LS1 1OP LS11 Smith COMP5122M Data Science Data cleaning and standardisation typographical errors (spelling errors, variation of names) deferent coding schemes (male/female, M/F) missing data changing data over time ID NN Name DoB Address PostCode GP Practice 01 222-2 David 12 Aug 1976 Flat 10 Lake LS1 1OP LS123 Smith Road Leeds 01 222-2 Dave 12/08/1976 11a Street Lane LS11 9OL LS1123 Smith COMP5122M Data Science Data cleaning and standardisation reformatting values to the common format removing punctuation phonetic encoding (soundex, methaphone, NYSIIS software) e.g. Peter, Pete -> p233 Anna, Ana->a566 name and address standardisation nick name and abbreviation lookups COMP5122M Data Science Blocking methods COMP5122M Data Science Blocking method 1 million records 5 million records 5 x 1012 (5 trillion record pairs) 57870.3 days 1902.5 months Assume: 1 comparison takes 1ms 158.54 years COMP5122M Data Science Blocking methods reduce the large amount of comparison remove the candidate record pairs which are not matches compare record pairs that have the same value (blocking key) for blocking variable COMP5122M Data Science Blocking methods Blocking variable Traditional blocking ID NN Name DoB Address PostCode GP Practice 01 222-2 David 12 Aug 1976 Flat 10 Lake LS1 1OP LS123 Smith Road 01 222-2 Dave 12/08/1976 11a Street Lane LS1 1OP LS1123 Smith Sorted neighbourhood approach Q-gram blocking Blocking key Canopy clusters COMP5122M Data Science Comparison and classification COMP5122M Data Science Classification Deterministic linkage - Exactly match on specified common fields - Easiest, quickest linkage strategy - Results in errors due to non-matches Probabilistic linkage - Statistically estimate likelihood that two records describe the same individual\entity, even if they disagree on some fields - Computationally complicated - Fewer non-matches Artificial intelligence approaches COMP5122M Data Science Deterministic linkage ID NN Name DoB Address PostCode GP Practice 23 222-2 David 12/08/1976 10 Lake Road LS1 1OP E12345 Smith ID NN Name DoB Address PostCode A&E 01 222-2 David 12 Aug 1076 Flat 10 Lake LS1 1OP LS123 Smith Road 01 Dave 12/08/1976 LS1 1OP LS11 Smith 1. If NN agrees then match 2. If not NN agrees and (any two from {Name, DoB, Address} agrees then match COMP5122M Data Science Probabilistic linkage ID NN Name DoB Address PostCode GP Practice 23 222-2 David 12/08/1976 10 Lake Road LS1 1OP E12345 Smith w1 w2 w3 w4 ID NN Name DoB Address PostCode A&E 01 222-2 David 12 Aug 1076 Flat 10 Lake LS1 1OP LS123 Smith Road 01 Dave 12/08/1976 LS1 1OP LS11 Smith Probability that a common variable agrees 𝒌 𝒘 𝒕 =∑ 𝒘 𝒊 𝒎𝒊 on a matched pair. 𝒘 𝒊= 𝒖𝒊 Probability that a 𝒊 common variable agrees on an unmatched pair. COMP5122M Data Science Artificial intelligence approaches supervised: require training data with correct matches. Machine learning methods are used: decision trees, SVM, ANN active and semi-supervised learning: require training data with correct matches. Iterative process of building classifier. Active learning is when we use a human being to help classify difficult cases. unsupervised: does not require training dataset. Using clustering methods we group pairs of records based on their similarities. COMP5122M Data Science Record linkage evaluation all possible record pairs generated candidate record pairs (by blocking) true matches - pairs of records correctly classified false matches - a wrong match (false positive) missed matches - a missed true match (false negative) COMP5122M Data Science Record linkage evaluation | candidate pairs | Reduction ratio: rr 1  | all pairs | | true matching | Pair completeness : pr 1  | all true matching | | true matching | Precision: prec  | all classified matching | | true matches | Recall: recall  | all true matches | COMP5122M Data Science Privacy-preserving record linkage secure way to link record of data from two or more organization (e.g. governmental agencies and health institution) COMP5122M Data Science Privacy-preserving record linkage - example http://www.cherel.org.au/how-record-linkage-works COMP5122M Data Science Privacy-preserving record linkage - example COMP5122M Data Science Spatial linkage Geographical location: Direct georeference (GPS, surveys) - point on a map defined coordinates, line, or polygon (boundaries) Indirect georeference: postal addresses, postal codes and place names. It does not include explicit coordinates. COMP5122M Data Science UK geographies Census geography Postal geography Health geography Electoral geography Administrative geography Other – Local Education Authority – Build-up areas – National Parks – Police Force Areas – Fire and Rescue Authorities COMP5122M Data Science Census geography Geography Population Household Min Max Min Max Output Area 100 625 40 250 (OA) Lower SOA 1000 3000 400 1200 Middle SOA 5000 15000 2000 6000 OAs are the lowest geographical level at which census estimates are provided OAs are built from clusters of adjacent unit postcodes OAs are subject to change due to the changes in the population, postcode and local authorities areas COMP5122M Data Science UK health geographies COMP5122M Data Science UK geographies vertical linkage COMP5122M Data Science Spatial linkage - example COMP5122M Data Science Lookup tables ONS Code History Database (CHD) Postcode lookup files: ONS Postcode Directory, NHS Postcode Directory, and https://data.gov.uk/ Lookup tables between geographies Methods: 1. Exact-fit – when one geography falls within boundary of other geography 2. Best-fit – when one geography boundaries straddles the boundary of other geography (based on population weighted centroid or mean grid reference of all the addresses) COMP5122M Data Science Geographical Information Systems GIS are designed to capture, store, manipulate, analyse, manage, and present all types of spatial or geographical data enables people to more easily see, analyse, and understand patterns and relationships maps create overlays from which we can extract the features of one data set that fall within the spatial extent of another dataset are used for geocoding (e.g. linking an address to a physical location on the earth) GIS calculates geographic coordinates before an address can be displayed on a map. COMP5122M Data Science Existing tools Deterministic linkage: sort-merge algorithms in Excel, R, Python, and other programming languages, sql select with joins [https://www.youtube.com/watch?v=HyZtBGXLN00] Probabilistic linkage: - RecordLinkage - the R package with blocking, phonetic encoding, string comparison methods - Tailor – the Record Linkage Toolbox (java), includes standard blocking, sorted neighbourhood, string comparison and phonetic encoding - Febrl – implemented in Python, includes techniques for data cleaning and standardisation, blocking, string comparison and classification Mapping: GeoConvert, MapInfo, QGIS, ArcGIS COMP5122M Data Science Data linking process for research- summary https://www.youtube.com/watch?v=smnnD9ZXwP0 COMP5122M Data Science References Peter Christen, Data Matching, Springer Heidelberg New York Dordrecht London, 2012 ONS Open Geography Portal http://geoportal.statistics.gov.uk/ COMP5122M Data Science Thank you COMP5122M Data Science

Use Quizgecko on...
Browser
Browser