KAD Summary PDF
Document Details
Uploaded by SalutarySeattle4990
Vrije Universiteit Amsterdam
Tags
Summary
This document provides a summary of knowledge and data, focusing on concepts like data, information, and knowledge. It also discusses knowledge graphs and formal systems, and provides an overview of using Protégé to create axioms with OWL.
Full Transcript
lOMoARcPSD|27185262 KAD summary Knowledge and Data (Vrije Universiteit Amsterdam) Scan to open on Studeersnel Studocu is not sponsored or endorsed by any college or university Downloaded by robert jahas ([email protected]) ...
lOMoARcPSD|27185262 KAD summary Knowledge and Data (Vrije Universiteit Amsterdam) Scan to open on Studeersnel Studocu is not sponsored or endorsed by any college or university Downloaded by robert jahas ([email protected]) lOMoARcPSD|27185262 Knowledge & Data Data, information, and knowledge: (raw) data: individual facts that are out of context, have no meaning, and are difficult to understand Information: a set of data in context with relevance to one or more people at a point in time or for a period of time Knowledge: the fact of condition of knowing something with familiarity gained through experience or association Pyramid: data to information to knowledge To make any sense out of the raw data and values, they must be processed and given a context o To increase the usefulness What can be done with the information and data requires knowledge o Also increases usefulness Knowledge = information + rules Knowledge can be tacit or explicit o Tacit knowledge AKA implicit knowledge The knowledge a person retain in their mind o Explicit knowledge AKA formal knowledge The knowledge that has been formalized, codified and stored Allows for interpretation of data, making data science easier Data preparation accounts for about 80% of the work of a data scientist o Includes building training sets, cleaning and organizing data, collecting data sets, mining data for patterns, refining algorithms, and other activities to prepare and clean the data The goal is a predictable inference o This is a two-way process o Formal knowledge can help us to interpret and reuse data and make it reusable for other purposes o Formal knowledge is necessary to efficiently interpret data Knowledge graphs: The most common way to write down and represent data (in AI) A useful way of representing data, information, and knowledge … o That are heterogeneous o In such a way that other systems/people can interpret a piece of data correctly o By making the semantics/meaning of a piece of information explicit o Using graphs and networks o Explicitly on the web Have edges, nodes, and relationships between items On the web: government data, social web data, medical data, museum data, research data Tim Berners-Lee: the inventor o the web and the semantic web Data landscape as a network of data access points o To achieve this, we must solve 2 problems Integrate (semantically) heterogenous … Integrate (physically) … Steps to make knowledge graphs (4Ps or proposals): o Give all things (that you want to know or can talk about) a name Downloaded by robert jahas ([email protected]) lOMoARcPSD|27185262 o The names are addresses on the web Give it URI (uniform resource identifiers) o Relations form a graph between things P1 + P2 + P3 = A (global) graph of Linked Data o Make explicit the meaning (semantics) of things (explicit semantics or formal knowledge) Not just the data, but its underlying model as well 1. Assign types to things 2. Assign types to relations 3. Organize types in a hierarchy Rules for calculating with that knowledge Semantics = predictable inferences o When I can predict what you will infer when I send you something Sub-symbolic AIinvolve neural networks and deep learning, while symbolic AI involves knowledge representation and graphs. Downloaded by robert jahas ([email protected]) lOMoARcPSD|27185262 Knowledge Graphs & Formal Foundation From databases to information: By splitting the data, the knowledge and data can be more focused and explicit o By adding more relations to one node More knowledge and semantics (explicit) o Domain and range o Subclasses We need an unambiguous language Formal systems: AKA logic Describing things you want to discuss Elements o Syntax Which expressions are legal and well-informed? 2 terms with a comparator between them A term is either a natural number, a variable, or a complex term A complex term is an operator (+=x/) applied to 2 terms ▪ In infix notation with parenthesis Describe valid arithmetic sentences o Semantics What do the legal expressions mean? The meaning of each sentence with respect to interpretations Truth is defined in terms of assignment (functions) for variables Example: x+2 >= y is true w.r.t. an assignment where Iv(x) = 7 Entailment: predictable inference ▪ If we know that x + y < 6, can we conclude that x < 10? o Calculus How to determine the meaning for legal expressions A concept is an abstraction or generalization from experience or the result of a transformation of existing ideas An assignment is a model of a knowledge base if it's a model for all its axioms Propositional logic as a formal system: A declarative sentence or a proposition is a statement that is either true or false Argument abstraction Properties o Syntax o Semantics o Calculus Connectives (and, or, either or, not, if then, if and only if) Constructs (for all, there exists, …) Infix notation: q -> p Prefix notation: -> , [q], [p] Truth value semantics o Formulas of propositional logic are used to express declarative statements Downloaded by robert jahas ([email protected]) lOMoARcPSD|27185262 Semantic entailment o A formula is semantically entailed if every valuation that makes all formulas that are true also make the second formula true (left true makes right true in truth table) o The core of logical reasoning o Counterexample: a false example o If there exists a counterexample, then there is no entailment Simple knowledge graph logic: We need an unambiguous language o Which statements are correctly written, what are the statements supposed to mean, …? Triples: transforming the data to formal syntaxes in a web of data o A definition of the data Simple semantics o A set of triples is entailed by a knowledge graph if it's a subgraph of the knowledge graph Downloaded by robert jahas ([email protected]) lOMoARcPSD|27185262 Knowledge Graphic Logic & KGs on The Web An interpretation is a model of knowledge base if it's a model of all its triples A triple is entailed by a knowledge base if it's true in all models of the knowledge base A model is an example or a triple that is true with reference to the interpretations Linking data: Linked (Open) Data o AKA LOP o Open data: about licenses to allow reuse o Linked data: about technology for interoperability Principals 1-3 o We need the data to be connected and accessible on the web o Represented in known data format (querying) Principal 4 o We need shared model and defined formal semantics o Predictable inferencing Technology: standards, standards, standards (query languages) Uniform resource identifier (URI) The hypertext transport protocol (HTTP) Namespaces Resource description framework (RDF) …. SPARQL (sparkle) OWL Ontology editors RDF: Resource description framework A query language to describe things Extends the linking structure of the web to use URIs to name the relationship between things as well as the two ends of the link (a triple) Allows structured and semi-structured data to be mixed, exposed, and shared across different applications A data model for data interchange on the web Facilitates data merging even if the underlying schemas/models differ The links form a directed and labeled graph The graph view is often used in visual explanations Components: o All information is expressed as triple: two-placed predicates o The tripes consist of a subject, a predicate, and an object (s, p, o) The first one = subject The middle one = predicate The last one = object The three words = fact o Triple = statement or fact or axiom o The elements of an RDF triple are either URI references, blank nodes (variables), or literals (string parts or values) Resources are identified by URIs, or URIs denote resources Downloaded by robert jahas ([email protected]) lOMoARcPSD|27185262 URIs can only refer to a resource o They are NOT the resource o Multiple URIs can denote the same resource (one resource can have multiple identifiers and point to the same thing) Internationalized Resource Identifiers (IRIs) are URIs that allow unicode characters RDF - URI != URL Use @ to define the URLs/URIs in the code to make it simpler and shorter Literals o Used to represent "literal" data values o All literals have a data type (string or integer) o Data types are also resources, referenced via URIs o Default: if no data type is specified, then the data type is assumed to be xsd:string o One can specify the language of a string using a language tag o Always go at the end! Graphs o A set of triples o A graph that contains two triples o In practice, many RDF graphs have URIs themselves, in which case the RDF graphs are not really graphs but actually hypergraphs o RDF graphs are often represented as a directed labelled graph Why use HTTP URIs? o They have a global scope and are grounded in society o Unique throughout the web, which helps avoid name clashes o They are also addresses, which exploit well-functioning machinery of web browsing o Tracks data by following the resource identifiers found in triples Why triples? o Any information format can be transformed into triples (very simple) o Relationships are made explicit and they are elements in their own right o Unlike database columns and binary predicates o The predicate is an element in the triple and can be described in RDF o "self-documenting" Why graphs? o A single but highly versatile format o Since RDF graphs are just a sent of triples, basic set operations are well-defined o Merging two RDF graphs? Take their union Tabular data: table dimensions must match Trees: a node can only have one parent o Extending an RDF graph? Add more triples No need to redefine the triples and just add it Blank nodes o BNodes o Resources without a URI o Used when resource is unknown or has no (natural) identifier o In logical terms, these are existential quantifiers o Need to extend the syntax of the formal semantics Triple grammar o Literals and BNodes may not appear in every position of a triple o Literals are just values (no relationships from literals allowed) o Blank nodes in predicate position "too meaningless" and confusing Downloaded by robert jahas ([email protected]) lOMoARcPSD|27185262 Properties of predicates o Very useful for modeling o Turns the graph into hypergraph o The information is only about the general predicate and not the specific use is a specific triple Serializations (… syntaxes) o RDF/XML: historical and outdated o Turtle: a convenient and human readable/writable syntax o N-Triples: one unabbreviated triple per line (easy parsing) o "Graph-aware": Trix, TriG, JSON-LD, N-Quads o RDFa: as annotations on HTML elements Turtle: Comments start with 'hash' character Full URI are surrounded by Statements are triples, which are terminated by a period. Use 'a' to abbreviate rdf:type o a Namespace prefixes are declared with @prefix A default namespace can be declared as well o @prefix dbpedia: … o @prefix : Literal values are enclosed in double quotes Possible with language or type information Numbers and Booleans can be written without quotes Shorthand: o Instead of repeating the subjects, the statements may and can share a subject with ';' o Instead of repeating the subjects and predicates, the statements may and care share the subject and predicate with ',' o Two ways of writing BNodes Written with underscores _... Written with brackets […] Semantics: Downloaded by robert jahas ([email protected]) lOMoARcPSD|27185262 Knowledge Graphic Logic & KGs on The Web An interpretation is a model of knowledge base if it's a model of all its triples A triple is entailed by a knowledge base if it's true in all models of the knowledge base A model is an example or a triple that is true with reference to the interpretations Linking data: Linked (Open) Data o AKA LOP o Open data: about licenses to allow reuse o Linked data: about technology for interoperability Principals 1-3 o We need the data to be connected and accessible on the web o Represented in known data format (querying) Principal 4 o We need shared model and defined formal semantics o Predictable inferencing Technology: standards, standards, standards (query languages) Uniform resource identifier (URI) The hypertext transport protocol (HTTP) Namespaces Resource description framework (RDF) …. SPARQL (sparkle) OWL Ontology editors RDF: Resource description framework A query language to describe things Extends the linking structure of the web to use URIs to name the relationship between things as well as the two ends of the link (a triple) Allows structured and semi-structured data to be mixed, exposed, and shared across different applications A data model for data interchange on the web Facilitates data merging even if the underlying schemas/models differ The links form a directed and labeled graph The graph view is often used in visual explanations Components: o All information is expressed as triple: two-placed predicates o The tripes consist of a subject, a predicate, and an object (s, p, o) The first one = subject The middle one = predicate The last one = object The three words = fact o Triple = statement or fact or axiom o The elements of an RDF triple are either URI references, blank nodes (variables), or literals (string parts or values) Resources are identified by URIs, or URIs denote resources Downloaded by robert jahas ([email protected]) lOMoARcPSD|27185262 URIs can only refer to a resource o They are NOT the resource o Multiple URIs can denote the same resource (one resource can have multiple identifiers and point to the same thing) Internationalized Resource Identifiers (IRIs) are URIs that allow unicode characters RDF - URI != URL Use @ to define the URLs/URIs in the code to make it simpler and shorter Literals o Used to represent "literal" data values o All literals have a data type (string or integer) o Data types are also resources, referenced via URIs o Default: if no data type is specified, then the data type is assumed to be xsd:string o One can specify the language of a string using a language tag o Always go at the end! Graphs o A set of triples o A graph that contains two triples o In practice, many RDF graphs have URIs themselves, in which case the RDF graphs are not really graphs but actually hypergraphs o RDF graphs are often represented as a directed labelled graph Why use HTTP URIs? o They have a global scope and are grounded in society o Unique throughout the web, which helps avoid name clashes o They are also addresses, which exploit well-functioning machinery of web browsing o Tracks data by following the resource identifiers found in triples Why triples? o Any information format can be transformed into triples (very simple) o Relationships are made explicit and they are elements in their own right o Unlike database columns and binary predicates o The predicate is an element in the triple and can be described in RDF o "self-documenting" Why graphs? o A single but highly versatile format o Since RDF graphs are just a sent of triples, basic set operations are well-defined o Merging two RDF graphs? Take their union Tabular data: table dimensions must match Trees: a node can only have one parent o Extending an RDF graph? Add more triples No need to redefine the triples and just add it Blank nodes o BNodes o Resources without a URI o Used when resource is unknown or has no (natural) identifier o In logical terms, these are existential quantifiers o Need to extend the syntax of the formal semantics Triple grammar o Literals and BNodes may not appear in every position of a triple o Literals are just values (no relationships from literals allowed) o Blank nodes in predicate position "too meaningless" and confusing Downloaded by robert jahas ([email protected]) lOMoARcPSD|27185262 Properties of predicates o Very useful for modeling o Turns the graph into hypergraph o The information is only about the general predicate and not the specific use is a specific triple Serializations (… syntaxes) o RDF/XML: historical and outdated o Turtle: a convenient and human readable/writable syntax o N-Triples: one unabbreviated triple per line (easy parsing) o "Graph-aware": Trix, TriG, JSON-LD, N-Quads o RDFa: as annotations on HTML elements Turtle: Comments start with 'hash' character Full URI are surrounded by Statements are triples, which are terminated by a period. Use 'a' to abbreviate rdf:type o a Namespace prefixes are declared with @prefix A default namespace can be declared as well o @prefix dbpedia: … o @prefix : Literal values are enclosed in double quotes Possible with language or type information Numbers and Booleans can be written without quotes Shorthand: o Instead of repeating the subjects, the statements may and can share a subject with ';' o Instead of repeating the subjects and predicates, the statements may and care share the subject and predicate with ',' o Two ways of writing BNodes Written with underscores _... Written with brackets […] RDF Schema: Observations o Properties are first class citizens o No strict distinction between … o … Additional stuff o Specify the human-readable label for a resource o Comment on a resource o Refer to another resource rdf:type rdfs:domain --> rdfs:Resource rdf:type rdfs:range --> rdfs:Class rdfs:subPropertyOf rdfs:range --> rdf:Property rdfs:label rdfs:range --> rdfs:Literal (in the rdfs domain, not rdf) Summary o Without formal semantics, the web of data is meaningless o Distinction between classes, properties, and instances (schema vs. data) o RDFS reserved symbols: rdfs:class, rdfs:subClassOf, rdfs:domain, rdfs:range, … Downloaded by robert jahas ([email protected]) lOMoARcPSD|27185262 o Entailment rules are expressed using reserved symbols o Inferencing is the application of entailment rules to formulas to produce new facts o RDFS is not very expressive RDFS and other vocabularies RDF vocab o The RDF vocabulary and RDFS reserved terms needed for the data model Explained before o Friend of a Friend (FOAF) Not a formal language/semantic o Dublin Core (DC and DCTerms) About documents o Schema.org Search engine companies to index their datasets o Dbpedia Ontology Originally from Wikipedia o WordNet Describe words o Thesauri Standard terminology in a particular domain o MeSH Medical subject headings o SKOS Vocabulary for Thesaurus modeling Formal semantics vs. social semantics o RDF, RDFS, OWL have a formally-defined semantics With reference to graphs (RDF) With reference to sets (RDFS/OWL) o FOAF and others have informal semantics Defined in textual descriptions Defined in their usage online Example: (domain: book) Instances o To Kill A Mockingbird (harper lee) o 1984 (geroge orwell) o The Handmaid's Tale (margaret atwood) Classes o Author o Genre o Book o Novel Properties o hasAuthor o hasGenre o hasEbook Triples o TKAMB hasAuthor Harper Lee o Harper Lee a (rdf:type) Author o THMT a (rdf:type) Novel Downloaded by robert jahas ([email protected]) lOMoARcPSD|27185262 o hasAuthor rdfs:domain Book o THMT a (rdf:type) Book o 1984 hasGenre sci-fi o Downloaded by robert jahas ([email protected]) lOMoARcPSD|27185262 Advanced Modeling & Inferencing in OWL Class Axioms & Property Types OWL: The web ontology language Features for defining classes and properties Built on description logics (DL) Extension of RDFS semantics and syntax Restriction of RDFS semantics and syntax Strict separation of instances, classes, and properties o But there's punning OWL class restrictions: A class is a set of individuals Examples o owl:equivalentClass o owl:complementaryClass o owl:unionOf o … owl:Restriction o An owl:Class defined by describing conditions on the individuals it contains o The class of things that … (the condition) Two types of restrictions o Necessary conditions The members of the class must have this property, object, or condition Uses rdfs:subClassOf to infer subclass relations and property values of known class members o Necessary and sufficient condition Anything that is a property of this condition is automatically a member of this class Uses owl:equivalentClass to infer class equivalence and class membership of individuals Properties to describe the conditions o Existential -> owl:someValuesFrom All members of a class have at least some value from the specified class o Universal -> owl:allValuesFrom All members of a class have only values from the specified class (all) o Specific value -> owl:hasValue All members of a class have this (specific) instance as value o Cardinality -> owl:minQualifiedCardinality, owl:maxQualifiedCardinality, owl:qualifiedCardinality All members of a class have exactly/min/max n values (from the specified class) o Is reflexive -> owl:hasSelf (Boolean) All members of a class have a relation with themselves (local reflexivity) Be careful of naming the range of property restrictions (should not be the same for other properties if it shouldn't be) Assertions: identity and negation o State that two individuals are different or the same o You can say that a property does not hold between two individuals Downloaded by robert jahas ([email protected]) lOMoARcPSD|27185262 Using Protégé for class axioms with OWL: Make entities o Drag the subclasses Start reasoner to get implied triples o Install the needed packages (pellet?) Save it as a ttl file in case it crashes At least x: min x Do the conditions in equivalent to Punning: Downloaded by robert jahas ([email protected]) lOMoARcPSD|27185262 Data Integration Steps: Build the ontology in protégé Import the turtle file into GraphDB Get some interesting data in CSV http://prefix.cc/ : look for prefix URLs SPARQL o Construct: get triples o Insert: insert the triples to the KG o For debugging the query or finding the problem, go to www.dbpedia.org/sparql Federated queries: Surface keywords Parts of the SPARQL query that has a "service" and "filter" Add multiple services and filters to the query Help: https://hub.compute.vu.nl/lander Dbpedia and Wikidata: practice writing queries for the final project (numeric values) o http://query.wikidata.org/sparql Use Wikidata numeric identifiers in the SPARQL query Downloaded by robert jahas ([email protected]) lOMoARcPSD|27185262 Final Project Deadlines: Milestone 1: Oct 13 o Can be reused in the final report o Do not have to be complete, but good feedback for more complete data Milestone Pitch: Oct 17 o Everything so far is graded (for the final report) o More like a pass/fail grade with feedback Final report: Oct 27 o 8-12 pages o Can go below/over the limit if it's complete o The visuals must be in the appendix Notes: Goals: o Design and implement a data analysis pipeline with KAD technology (OWL, RDF) o Document the process and present the result Deadline 1: Milestone o What kind of question do you plan to answer? I plan to make a website for the cats … o Who is this application meant for? The stakeholders What kind of information does this investigation serve? o Show the design What type of analysis do you want to have and use? What kind of visualizations ▪ Can be a table, flowchart, etc. o Consider at least 2 ontologies Motivate your design choice Simple or complex ontologies o Use at least 2 external sources of data Like in RDF format (CSV, XML) Anything that includes converting to the right format For the SPARQL endpoint o Ontology engineering Describe the domain and scope (must match the visualization) Describe the methodology and how the ontology is constructed (how it's implemented) Show a conceptualization of the domain (classes, properties, relations) 15 classes, 5 properties Deadline 2: Milestone Pitch o In one of the practical sessions o At least one member per group o 2 min max o What is the idea of the data visualization? o What ontologies will you (re)use? o What datasets? o What is the status of the report? About your own ontology Deadline 3: Final Report Downloaded by robert jahas ([email protected]) lOMoARcPSD|27185262 o An ontology created by the members At least 3 class restrictions (necessary / necessary and sufficient) Integrated in some way Submitted as a Turtle file o Write the code to create the visualization Upload the final screenshots o Update and complete the contents of the milestone o Describe the steps of integration and how/what was integrated o Include the screenshots that it works o Include the Jupyter Notebook Pandas description: like a README file (the steps to do) o Justification section Be specific about who did what Nothing big, but preferably equal work o References Allowed to reuse existing content from the web, but cite it correctly No copyrighting o Requirements PDF report (8-12 pages) with visualizations and texts Turtle file of the final ontology Jupyter Notebook + README file o Downloaded by robert jahas ([email protected])