Web Technologies Summary PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides a summary of web technologies, including a history of the web, different web architectures, and an overview of key concepts like HTML5, CSS, XML, and web 2.0. It also discusses semantic web technologies, web search, security, and future trends.
Full Transcript
Web Technologies Summary Contents History................................................................................................................................... 2 Web Architectures...............................................................................................................
Web Technologies Summary Contents History................................................................................................................................... 2 Web Architectures................................................................................................................. 4 HTML5 and the Open Web Platform........................................................................................ 8 CSS..................................................................................................................................... 10 XML and Related Technologies............................................................................................. 10 Web 2.0 Patterns and Technologies...................................................................................... 13 Semantic Web and Web 3.0.................................................................................................. 16 Web Search and SEO........................................................................................................... 19 Security, Privacy and Trust.................................................................................................... 22 Future Trends....................................................................................................................... 25 1 History - Reading wheel o comparable to modern tabbed browsing o seen as predecessor hypertext - Dewey Decimal Classification o 10 classes * 10 divisions * 10 sections (+ decimals) ▪ E.g. 395 → class 3, division 9, section 5 - As We May Think (Vannevar Bush, 1945) Add paper content !!! o Associative indexing instead of hierarchical indexing (like the mind) o “Origin” hypertext o Memex (memory extender) → hypertext machine ▪ Trails (cross-references) between microfilms (pieces of information) ▪ Trail blazers → job for trail makers (idea) - Hypertext (Ted Nelson, 1965) o Xanadu (first hypertext project) ▪ Transclusion (referencing/embedding parts of a document in another) ▪ Bidirectional links o Digital document constrained to printable content (PAPER UNDER GLASS) - Hypertext Editing System (HES) o Limitations ▪ Unidirectional links ▪ Non overlapping links (no separate links for overlapping content in doc) ▪ Only embedded links (no storing of separate creation of links) o File Retrieval and Editing System (FRESS) ▪ Follow-up project ▪ First introduction Undo - oNLine System (NLS) → vision future interactive computing o first practical use of hypertext o computer mouse o remove collaboration o... - Hypermedia → extension of hypertext with other media types o Aspen Moviemap (Early hypermedia system) ▪ Pictures taken every 10 feet while driving through the city ▪ Similar concept now used in Google Streetview - HyperCard (Apple, 1987) o Early widespread hypermedia system o Information stored in cards, arranged into stacks o Links can be defined between cards o May contain text, pictures, audio, video o HyperTalk programming language 2 - ARPANET (Advanced Research Projects Agency NETwork) (1969) o Early version internet o First operational packet switching network ▪ Messages passed via other connect machines → no direct connection o Applications: Email, FTP o Network over mostly USA university machines ( + few Europe) - Transmission Control Protocol (TCP) (1974) o Replacement network control protocol (NCP) o “Assume that hardware is unreliable, build reliability into software” o Protocol for Packet Network Interconnection o Transition ARPANET → TCP/IP - TCP/IP (1978) → 4 layers HTTP, FTP, POP, … TCP, UDP, … Addressing hosts and packet routing Link layer - World Wide Web (WWW) o Originally: The Information Mine (Tim) → networked hypertext system over ARPANET to share information at CERN o By 1990 ▪ HyperText Transfer Protocol (HTTP) ▪ HyperText Markup Language (HTML) ▪ HTTP server software ▪ Web browser (WorldWideWeb) o Networked-enabled version of the HES model ▪ Unidirectional links (no bidirectional links) ▪ No Transclusion ▪ No external (non-embedded) links - Mobile Web o Access the Web from anywhere at anytime - Web 2.0 o No new technology! o User becomes author (wikis, tagging, …) - Web 3.0 o Semantic Web / Machine-interpretable Web o Add explicit semantics to web resources o Use of ontologies o Potential reasoning over web resources 3 - Internet of Things / Web of things o Physical objects with embedded computing functionality with active/passive participation in the Web o Ubiquitous Computing, Disappearing Computing, Pervasive Computing Web Architectures - Basic Client-Server Web Architecture o Effect of going to http://www.vub.be: 1) Use a Domain Name Service (DNS) to get IP address for www.vub.ac.be 2) Create a TCP connection to IP address 3) Send a HTTP request message over the TCP connection 4) Visualise the received HTTP response message in the browser - Web Server o Tasks 1) Setup connection 2) Receive and process HTTP request 3) Fetch resource 4) Create HTTP response 5) Logging o Prominent web servers: nginx, Apache HTTP Server - HTTP Protocol o Communication always initiated by client o Stateless protocol (no sessions) o HTTPS scheme for encrypted connections - Uniform Resource Identifiers (URI) o Uniform Resource Locator (URL) ▪ Information about exact location resource ▪ scheme, host, path (e.g. https://vub.academia.edu/BeatSigner) ▪ resource moved → URL different → Persistent URL (PURLs) o Uniform Resource Name (URN) ▪ Unique and location independent name ▪ scheme name, namespace identifier, namespace-specific string (e.g. urn:ISBN:3837027139) 4 - HTTP Message Format o Request ▪ Start line ▪ Methods GET: get resource HEAD: get header POST: send data (in the body) PUT: store request body TRACE: get the “final” request (after it has potentially been modified by proxies) OPTIONS: get list of (HTTP) methods supported by servers DELETE: delete a resource o Response ▪ Start line ▪ Status codes 100-199: informational 200-299: success 300-399: redirection 400-499: client error 500-599: server error o Header fields ▪ General, specific, entity and extension headers: Accept → Media type client will accept User-Agent → type of client Keep-Alive / Persistent → for performance (otherwise new HTTP connection for every webpage element) Content-type → body’s media type If-Modified-Since (with GET → conditional GET) → resource returned if modified since specified date 5 o Media types (MIME Types) → defines message body content (for processing) ▪ e.g. text/plain, text/html, image/jpeg, … o Message Information (viewable in browser developer console) - Proxies o Web proxy between client and server o Server for client, client for server o Used for: firewalls, content filters, transcoding, content router, …) - Caches o Proxy server to reduce server load if multiple clients share same cache o Cache hierarchies (communicate with Internet Cache Protocol ICP) o Special HTTP cache control header fields: expires, max-age, no-cache o Validators: Last-modified time, Entity tags o Advantages: ▪ Reduced latency and bandwidth ▪ Reduced server load ▪ Transparent o Disadvantages ▪ Additional hardware required ▪ Might get outdated data out of cache ▪ Server less control - Tunnelling o Transmit one protocol encapsulated inside another protocol o Often used to open a firewall to protocols that would otherwise be blocked 6 - Gateways o Glue/translator between client and server o Application server → combination gateway and destination server - Session management → because HTTP is stateless, possible implementations: o IP address (often not unique to user) o Browser login (HTTP authentication headers, send user info with each request) o URL rewriting (add information to URL in each request) o Hidden form fields (like URL rewriting, but in body) o Cookies - Cookies o Piece of information assigned to the client on their first visit o Key value pairs o Sent via Set-Cookie HTTP response headers o Sent back every time same server is accessed o Potential privacy issues ▪ Persistent cookies with long lifetime ▪ Third-party cookies for user tracking across websites - HTML o Dominant markup language for webpages - Dynamic Web Content o Server side ▪ Common Gateway Interface (CGI) Certain requests forwarded via CGI to a program Program processes request and creates answer Problems: poor performance (new database connections) Solution: FastCGI with persistent processes and process pools ▪ Java Servlets Java class that extends the abstract HTTPServlet class Loaded by a servlet container Relevant requests are forwarded to servlet instance for further processing Disadvantage: whole page must be defined within servlet ▪ Jakarta Server Pages (JSP) Add program code through scriptlets and markup to existing HTML pages Interpreted on the fly or compiled into java servlets ▪ Node.js Server-side JavaScript Handle requests, database, sessions, … High modularity (packages, frameworks, …) 7 o Client side ▪ JavaScript Interpreted scripting language for client-side processing Embedded in HTML or separate file ▪ Java Applets Program delivered to the client side in the form of Java bytecode and runs in a sandbox (JVM) Advantages: most recent version in browser, high security Disadvantages: Java plug-in, advanced functionality for signed applets Replaced by Java Web Start (JavaWS) o Web Application Frameworks ▪ Software framework to support development of dynamic websites, web applications, web services and web resources. ▪ Database access, templating frameworks, session management, code reuse ▪ Faster more robust development process ▪ Model-View-Controller (MVC) design pattern o No mix of application logic and view o Model ▪ Data (state) and business logic ▪ Multiple views can be defined for single model ▪ State model changes → view notified o View ▪ Renders data of the model ▪ Notifies controller about changes o Controller ▪ Processes interactions with the view ▪ Transforms view interactions into operations on the model (state modification) HTML5 and the Open Web Platform - HyperText Markup Language (HTML) o Application of Standard Generalized Markup Language (SGML) o Markup tags to define the structure and presentation of an HTML doc (webpage) o Nested tags, tags with attributes o HTML → Document Object Model (DOM) by browser ▪ Standard to create, read, update and delete HTML elements o Hyperlinks to connect different HTML documents (unidirectional, embedded) - History o HTML 1.0 o HTML 2.0 ▪ browser war between Netscape and Internet Explorer o HTML 3.2 ▪ Developed only by W3C ▪ Tables, visual appearance elements (not the original idea for HTML!!) 8 o HTML 4.0 ▪ Unicode (internationalisation) ▪ CSS ▪ W3C stopped developing HTML o XHTML 1 ▪ XML application of HTML ▪ Draconian error handling (handle errors in HTML docs) o XHTML 2.0 ▪ Revolutionary changes, but broke backwards compatibility, meaning that originally HTML pages had to fix all the error that were ignored o HTML5 ▪ Developed by Web Hypertext Application Technology Working Group (WHATWG) and W3C ▪ HTML – Living Standard → current standard ▪ Continually developed by community - Problems o Mix content, structure, presentation o Forgiving browsers rendering HTML docs with errors - XHTML o HTML as XML application (instead of SGML) o Strict adherence to standard - HTML5 Design Principles o Compatibility ▪ Evolve the language (backward compatibility) o Utility ▪ Separation of content and presentation ▪ Solve real problems (pragmatic approach) o Interoperability ▪ Interoperable browser behaviour ▪ Identical error handling across browsers o Universal Access ▪ Work across platforms, devices and media ▪ Accessible to users with disabilities o Simple is better o Avoid external plug-ins - Open Web Platform APIs o Standard way for accessing specific functionality o E.g. SVG, RDFAs, Geo Location, JavaScript, File API, Fullscreen, … - HTML5 Markup o Added structural and media tags o Removed presentation tags o Improved forms → simple HTML client-side form validation o Video, audio , 2D and 3D graphics, vector graphics - Web Storage API o localStorage (same-origin policy → no time limit) o sessionStorage (per window → delete when browser window closed) o replace cookies for large data 9 - Web workers o Execute JavaScript in background (otherwise page non-responsive) o Avoid complexity → independent JavaScript contexts + event-driven message passing - WebSocket API o Bidirectional, full-duplex socket connection o Allows server-initiated updates o HTTP → WebSocket P (Connection: Upgrade and Upgrade: websocket in header) - Geolocation API - Offline Web Application - Fullscreen API - Screen Orientation API - Page Visibility API - Battery Status API - Vibration API - Web notification API - … CSS - Cascading Style Sheets (CSS) o Separation presentation (css) and content (html) o Enable multiple presentations of the same content o Versions: ▪ CSS1 ▪ CSS2 → relative, absolute, fixed positioning ▪ CSS3 → 2D and 3D transformations, transitions, flex, media queries, … divided into separate modules - Inclusion o Inline style ▪ Mixes content and presentation o Internal style sheet ▪ In style tag in head HTML doc o External style sheet ▪ Link to stylesheet XML and Related Technologies - eXtensible Markup Language (XML) o standardised text format for (semi-)structured information o meta markup language → tool for defining other markup languages o ordered labelled tree (think HTML structure) o simple, general, accepted o not a programming language o not a database (lacks database management system features) 10 - Evolution o descendant of Standard Generalized Markup Language o “SGML-Lite” - XML Specification o Grammar for XML documentation (tag placement, legal element names, …) o General tools (parsers, editors, programming APIs) - XML Tree Document Structure (7 node types) o Root node o Element node o Attribute node o Text node o Comment node o Processing instruction node o Namespace node (useful if use more than one specification) - Well-Formedness and Validity o Well formed → follows XML specification (correct nesting, valid names, …) o Valid → follows Document Type Definition (DTD) or XML Schema ▪ Custom specification by developer ▪ Can be checked with accompanying parser - How XML is different from HTML o Tool for specifying markup languages o Not a presentation language o Support applications outside of web browsing as well o Must be well-formed and valid o Readability > conciseness o Matching tags are case sensitive - XHTML o Reformulation of HTML into XML application o Stricter than HTML (title first element in head, lowercase, namespace declared, …) - XML Technologies o XPath and XPointer ▪ Addressing XML elements and parts of elements o XSL (Extensible Stylesheet Language) ▪ Transforming XML documents o XLink (XML Linking Language) o XQuery (XML Query Language) o Document Type Definition (DTD) and XML Schema ▪ Definition schemas XML documents ▪ DTD → limited expressive power ▪ XML Schema → datatypes, inheritance, … o SAX (Simple API for XML) ▪ Event-based programming API for reading XML documents) o DOM (Document Object Model) ▪ Programming API to access and manipulate XML documents as tree structures 11 - XPath o Expression language to address elements of an XML doc o Location path → sequence of location steps separated by slash - XSLT (Extensible Stylesheet Language Transformations) o Expression-based language based on functional programming concepts o Most important part of XSL o Uses XPath for navigation o Pattern matching to select parts of documents o Templates to perform transformations - XPointer (XML Pointer Language) o Address points or ranges o Uses XPath o Relative addressing, links to elements without anchors - XLink (XML Linking Language) o Create links in XML docs o Links can be defined in separate documents o Simple links (thing HTML) and extended links (associate arbitrary number of resources) o Annotea project (uses XLink for managing external annotations) used in Amaya Web Browser - SAX (Simple API for XML) o Scans doc, invokes callback methods at start doc, end doc, start tag, end tag, character data, processing instruction o Less memory than DOM parser - Document Object Model (DOM) o Language neutral API for accessing and manipulating XML documents as a tree structure o Whole doc must be read and parsed before using in DOM application o Types of DOM core interfaces ▪ Node generic interface ▪ Node specific interface - XML for Data Interchange o General way to query data from different systems (e.g. via XQuery) o Connect applications running on different operating systems and computers with different architectures ▪ XML Remote Procedure Call (XML-RPC) ▪ Simple Object Access Protocol (SOAP) Successor Used for accessing Big Web Services - XML Remote Procedure Call (XML-RPC) o Advantages: ▪ Understood by different applications (XML-based lingua franca) ▪ HTTP carrier protocol ▪ Based on HTTP and XML standards → easy to implement o Disadvantages: ▪ Slower then specialised protocols used in closed networks o GOMES (GUI for Object Model multi-user Extended fileSystem) ▪ Uses XML-RPC to communicate with OMES (coded in Oberon) 12 - eXtensible Information Management Architecture (XIMA) o generic database interface o XSLT stylesheet chosen on device type (User-Agent HTTP header field) o HTML pc browser, WML old phone browser, VXML voice browser, … Web 2.0 Patterns and Technologies - Web 2.0 o New generation of web apps o User generated content o Data as a driving force o Collective intelligence o Web as a platform (instead of static pages) o Not a new technology - Main ingredients o Social Web ▪ prosumer → producer + consumer (e.g. wikis, blogs, social medea) ▪ democracy o Rich Internet Applications (RIAs) ▪ Desktop on browser ▪ Highly interactive applications (e.g. google docs) ▪ Based on AJAX o Service Oriented Architectures (SOAs) ▪ Enable sharing of information and services between different Web 2.0 applications - The Long Tail o New economic model: combine infinite shelf space with shared real-time public opinions and buying trends o Major part of web content are small sites → provide tools to address long tail (as opposed to only the head like popular content) - Wikis o Any user can create new pages or edit existing pages o Democracy-based control of content o Reliability never guaranteed - Blogs o Chronologically ordered list of information o Delivering news and getting in touch with community - Flickr o Image hosing and sharing website o User-generated taxonomy (folksonomy) o Images may be added to multiple albums (which filesystem lacks) - Folksonomies o Folk + taxonomies o User generated taxonomy o Social tagging (e.g. Instagram, Annotea) - Social Implications of Web 2.0 o Data ownership and copyright issues 13 o Collective intelligence (wisdom of crowds) o Controlled media (e.g. CNN) → collaborative communities (e.g. twitter) o New crediting models o Everybody has a (big) voice o The kindness of strangers (video content !!) - The Programmable Web o Based on HTTP o Data encoded in XML (or JSON, plain text, HTML, binary) - Rich Internet Applications (RIAs) o Bring desktop to browser o Highly responsive (async and partial content updates) o Rich Graphical user Interface (GUI) - Asynchronous partial Updates o Asynchronous update parts of resource (instead of whole resource) o Initiated by client (keypress, state change, …) o Updated cannot be initiated by the server if HTTP is used! o AJAX - Asynchronous JavaScript and XML (AJAX) o (nowadays JSON instead of XML) o Not a technology by itself ▪ HTML and CSS for visualisation ▪ JavaScript with DOM for dynamic change of information presented ▪ Method to asynchronously exchange data between client and server ▪ Client-side AJAX engine (deal with asynchronous message handling) o XMLHttpRequest Object ▪ onreadystatechange (register callback) ▪ readyState (response status server) 0 (uninitialized) → created, but not initialised 1 (open) → created, but send method not called 2 (sent) → send called, HTTP response header received 3 (receiving) → response not yet available 4 (loaded) → response available, data received ▪ responseText, responseBody, responseXML (server response) o Advantages ▪ Reduced load time and higher responsiveness ▪ State can be maintained o Disadvantages ▪ Not possible to bookmark any particular state of an application ▪ Content might not be crawled ▪ Cannot be used in browsers with disabled JavaScript functionality - Service-Oriented Architecture (SOA) o Architecture that modularises functionality as interoperable services o Software as a service - Representational State Transfer (REST) o Architectural style for distributed hypermedia systems o Application is a RESTful service if it follows constraints: ▪ Separation of concerns between client and server 14 ▪ Uniform interface Identification of resources (like URIs on the Web) Manipulation of resources on the server via representation on the client side Self-describing messages (like media type on the Web) Hypermedia for application state change (like hypertext links to related resources) ▪ Stateless (No client side state stored on server) ▪ Cacheability (response say if they’re cacheable or not) ▪ Layering (proxies can be transparently added) ▪ Code on demand (server can send application logic to the client) Optional - Web Services o Web-based client-server communication over HTTP o Big Web Services ▪ Web Service Description Language (WSDL) XML application to describe a Web Service’s functionality Complex (usually generated with third party service) ▪ Universal Description, Discovery and Integration (UDDI) Yellow pages for WSDL Global registry describing available business services Very complex ▪ Simple Object Access Protocol (SOAP) XML based communication protocol Defines an envelope for transporting XML messages Often sent via HTTP POST requests Advantages: o Platform and language independent o SOAP over HTTP → less issues with proxies and firewalls Disadvantages: o HTTP reduced to simple transport protocol o No caching ▪ Web Service Stack → contains many other protocols o RESTful Web Servies ▪ Simple web service implemented using HTTP ▪ RESTful web service definition URI + supported datatypes + supported HTTP methods ▪ One to one mapping of CRUD operations: POST (create), GET (read), PUT (update), DELETE (delete) o Really Simple Syndication (RSS) ▪ Format (in XML) that is used to read and write frequently updated information on the web (e.g. blog entries, news channel) 15 Semantic Web and Web 3.0 - The Semantic Web o Meaning of data on the web also discovered by machine without human intervention o Web of Documents → Web of Data ▪ Web as a decentralized database (knowledge base) ▪ Machine-accessible data ▪ Interconnected (like current web) ▪ Machine-readable metadata for existing web content ▪ combination of data from different sources to derive new facts ▪ machines use logical reasoning to infer facts that are not explicitly recorded o Crucial component of Web 3.0 / Giant Global Graph - Semantic Web Stack o Architecture of the Semantic Web o URI/IRI ▪ Unique identification of semantic web resources o Unicode ▪ Representing/manipulating text in different languages o XML ▪ Interchange of structured data over the Web o XML Namespaces ▪ Uniquely qualify markup from multiple sources (integration) o Resource Description Framework (RDF) ▪ Define RDF triples and represent resource information in a graph structure ▪ Instance level o RDF Schema (RDFS) ▪ Create hierarchies of classes and properties ▪ Class level o Web Ontology Language (OWL) ▪ Language to define vocabularies ▪ Extends RDFS with more advanced features (e.g. cardinality) ▪ Enables reasoning based on description logic o SPARQL ▪ Query language to query any RDF-based data o Rule Interchange Format (RIF) and Semantic Web Rule Language (SWRL) ▪ Describe relations that cannot be described in OWL o Unifying Logic ▪ Logical reasoning (infer new facts and check consistency) o Proof ▪ Explain logical reasoning steps 16 o Cryptography ▪ Protect RDF data via encryption ▪ Validate the source of facts by digitally signing RDF data o Trust ▪ Authentication of sources and trustworthiness of derived facts o User Interface ▪ User interfaces for semantic web applications - Resource Description Framework (RDF) o Describes data and metadata about specific subjects, structure of data sets, relationships between bits of data o RDF statement (triple) consists of {subject, predicate/property, object/value} o Subjects, predicates and objects are all resources ▪ Subject → URI reference or blank node ▪ Predicate → URI reference defining the relationship ▪ Object → URI reference, literal or blank node o Stored in relational databases or triplestores o Advantages : ▪ Simple ▪ Enables merging data from different data models (only URI needed) ▪ Same resource can be annotated by different people ▪ Well-defined standard - RDF Graph o Set of RDF statements can be represented as a directed labelled graph o Nodes → specific instances (because RDF is instance based) o Anonymous resources don’t have explicit identifiers → blank node - RDF Reification o RDF triple is not a resource o Reify statement → make resource out of statement as a blank node ▪ (blanknode, isSubject, originalStatementSubject) ▪ (blanknode, isObject, originalStatementObject) ▪ (blanknode, Property, originalStatementProperty) - RDF Schema (RDFS) o Vocabulary description language for RDF o Define common concepts and relationships ▪ Classes and subclasses ▪ Properties and subproperties ▪ Domain and range of a property ▪ See Also :> isDefinedBy ▪ Label, comment , … o Provides basic elements for the definition of ontologies o Advantages: ▪ Richer expressiveness with RDFS ▪ Simple reasoning ▪ Many existing tools to deal with RDFS o Disadvantages: cannot express ▪ Requirement ▪ Cardinality ▪ Symmetry 17 o RDS(S)/XML Serialisation ▪ Standard is hard ▪ RDF Notation 3 (N3) (short non-XML serialisation) ▪ RDF Turtle Notation (removes unnecessary features from N3) o RDF applications ▪ Annotea project (defines RDF Schema for types of annotations that can be used to annotate webpages) ▪ RSS ▪ Dublin Core (widely used to describe digital media also in HTML) - SPARQL Query Language o Extract information as URIs, literals, blank nodes or subgraphs o Syntax similar to SQL - Jena Semantic Web Framework for Java - Protégé (free open-source platform to create, manipulate and visualize ontologies) - Friend of a Friend (FOAF) o First social semantic web application o Describe a social network without a central database - Semantic Wikis o Use semantic web technologies to provide machine-processable Wiki content o Much richer query interface o Existing semantic wikis: DBPedia (semantic web version of Wikipedia) - Linked Data o Link different data sources on the Web o Linked Open Data cloud project ▪ RDF triples from currently 1000+ datasets - Semantic Desktops o Apply semantic web technologies to personal information management (PIM) ▪ Inter-application data sharing ▪ Enhancement of limited filesystem functionality - Microformats o Add semantics to (X)HTML pages (through classes) o Some search engines pay attention to different types of microformats - RDF in Attributes (RDFa) o Add a set of attribute extensions to (X)HTML for embedding RDF metadata o Search engines process certain RDFa metadata (e.g. product information) - Microdata o HTML5 in-house support o Add machine readable metadata (semantics) to HTML5 documents in the form of key/value pairs o Can be used by crawlers, search engines and browsers for richer browsing experience o Alternative to microformats and RDFa 18 Web Search and SEO - Search Engine Result Page (SERP) o Organic and non-organic (paid) search results - History o Start at Bush’s Memex o Archie (first internet search engine) o W3Catalog (first web search engine, manually maintained) o JumpStation (first engine to combine crawling, indexing and searching) o A lot of new search engines o Early web search solutions ▪ Full-text search (web crawler + indexer) ▪ Manually maintained classification - Information Retrieval (performance measures) o Precision: ratio right picks and all picks o Recall: ratio right picks and all right ones o F-score: 2 * (precision * recall) / (precision + recall) - Boolean Model o Index of keywords x documents with true or false for if a keyword appears in a document o Uses boolean operators to find result set o Advantages: ▪ Easy to implement, scalable ▪ Fast query processing o Disadvantages: ▪ No ranking of output ▪ Often user needs to learn special syntax to search for phrases o Variant (like inverted index) form basis of many search engines - Web Search Engines o Most based on traditional information retrieval techniques o Deal with: ▪ Immense amount of data ▪ Hyperlinked resources ▪ Dynamic content with frequent updates ▪ Self-organized web resources o Query answer time - Web Crawler o Used to create index of webpages to be used by search engine o Must deal with following issues: ▪ Freshness (updated frequently) ▪ Quality (priority for high quality pages) ▪ Scalability (should be able to increase crawl rate by adding more servers) ▪ Distribution (run in a distributed manner) ▪ Robustness (deal with page errors and crawler traps) ▪ Efficiency (resources used in most efficient way) ▪ Crawl rates (pay attention to existing web server policies) 19 - Pre-1998 Web Search o Ranking based on on-page factors (IR): poor quality of search results (order) o PageRank → absolute quality of a page ▪ Query-independent - PageRank o High PageRank → many pages linking to it, high rankers linking to it o Total score = IR score * PageRank o Algorithm: ▪ Sum over every incoming page: rank page / number of outgoing links ▪ Slow formula → transformed into matrix multiplication ▪ Power method to find R (Rt+1 = HRt) o Dangling Pages (Rank Sink) ▪ Pages without outgoing links become PageRank sinks ▪ Solution: add artificial outgoing links to all pages (including itself) (Stochastic adjustement) o Strongly Connected pages (Graph) ▪ Add new probabilities between all pages ▪ Prob d → follow hyperlink structure ▪ Prob d – 1 → choose random page ▪ Matrix G represents a random surfer - Google Search Central o Services and information about a websites o Site configuration (submission of sitemap, crawler access, URLs of indexed pages) o Site performance (search queries, countries, devices, …) o Enhancements (core web vitals, mobile usability) o Security issues - XML Sitemaps o List of URLS that should be crawled and indexed (with frequenties) o Just a suggestion (not guaranteed to be added to index) o Additional metadata might be provided to search engines - Search engine Marketing (SEM) o Aims to increase visibility of a website ▪ Search engine optimisation (SEO) ▪ Paid search advertising ▪ Social media marketing o SEO should not be decoupled from content, structure, design, … o SEO is a continuous process (rapidly changing environment) - Search Engine Optimisation (SEO) o On and off page factors o Difference between optimisation and breaking search engine rules (white hat and black hat optimisations) o Break rules → to google hell (to the supplemental index) o Positive On-Page factors ▪ Use of keywords in relevant places (title tag, url, domain name, header tags, …) ▪ Mobile usability (mobile-first indexing) ▪ Fast page load times 20 ▪ Provide metadata ▪ Quality of HTML code ▪ Security and accessibility ▪ Uniqueness of content across the website ▪ Flat website structure (minimise link depth) ▪ Use keyword-rich anchor texts ▪ Avoid PageRank leakage ▪ Increase number of pages ▪ Think about hidden content (RIA) ▪ Consistent webpage addressing) o Negative On-Page factors ▪ Links to “bad neighbourhoods” ▪ Link selling ▪ Keyword stuffing ▪ Hidden content (same colour as background) ▪ Cloaking (different content for spider and user) ▪ Malware being hosted on the page ▪ Duplicate or similar content ▪ Slow page load time ▪ Copyright violations o Positive Off-Page factors ▪ Links from pages with high PageRank ▪ Keywords in anchor text of inbound text ▪ Links from topically relevant sites ▪ High clickthrough rate from search engine ▪ High number of shares ▪ Site age (implying stability) ▪ Domain expiration date (implying longevity) o Negative Off-Page Factors ▪ Not accessible to crawlers ▪ High bounce rate (back button when entering page) ▪ Link buying ▪ Link farms ▪ Spamdexing (adding links to your page from other pages) Solution ‘nofollow’ links Exploited by SEO experts: PageRank sculpting (control flow PageRank within website) ▪ Links from bad neighbourhoods? ▪ Duplicate content (from competitor)? - Non-organic Search o Cost per impression or cost per click o Not independent from organic search ▪ E.g. Landing page o Google ads → non-organic web search service ▪ Ads or search boost 21 Security, Privacy and Trust - Security Aspects o Authenticity (knowing the sender or receiver of data) o Privacy (keeping information private) o Integrity (ensuring information is not changed when transferred) - HTTP Authentication o Native authentication functionality offered by HTTP ▪ Server can first respond with authentication challenge to request o HTTP is extensible → support other authentication (like TLS) protocols and offers ▪ Basic access authentication (Base64 encoding username:password) ▪ Digest access authentication o Security realms → grouped protected resources with different sets of authorised users or groups of users - Basic Access Authentication o Base64 Encoding ▪ Represent binary data in portable format ▪ Takes sequence of bytes and breaks it into 6-bit chunks ▪ 6-bit chunks represented by 64-character alphabet ▪ Not secure (easily reversible) → wasn’t made for encryption o Web Server Configuration ▪ Create password file ▪ Put.htaccess file with configuration into directory that has to be protected o Basic Access Authentication not secure ▪ Username password sent almost in cleartext ▪ Easy to do replay attacks o Potential solutions ▪ Combine BAA with encrypted data transfer → does not prevent replay attacks ▪ Digest Access Authorization 22 - Digest Access Authentication o One-way digest computed to send to server ▪ With hash function (irreversible) o Servers sends special token (nonce) that changes with every request ▪ Incorporated in digest ▪ → server sees if nonce is used again ▪ Pre-emptive authorization → send next nonce in advance and send computed hash with the original request - Transport Layer Security o Cryptographic protocol to ensure secure network communication o Situated at the application layer or presentation layer o Types of authentication ▪ Unilateral authentication (server side) ▪ Mutual authentication (client and server side) - Cryptography o Cipher (coding scheme) used with a key to create ciphertext out of plaintext o Cryptoanalysis → get information out of ciphertext without having access to secret information or key - Symmetric Key Cryptography o Same key for encoding and decoding o Only key secret (algorithms are public) o Enumeration attack → tries all keys o Problem: ▪ secretly share the common key ▪ repeated for every pair of communicators ▪ insecure channels ▪ storage? - Public Key (Asymmetric Cryptography) o Asymmetric pair of keys ▪ Publicly available key for encoding ▪ Secret key for decoding o Each party has a single public key used to encode messages for that party o That party can decode those messages with their private key o No need to secretly share keys anymore 23 o RSA cipher ▪ Public key cipher for encryption and signing ▪ Keys generated based on multiplication of large prime numbers ▪ Factorisation (basically) not possible (unless in possession of insanely powerful computer like quantum computer) o Problem: ▪ Much slower than symmetric ciphers o Hybrid solution: ▪ Public key encryption used in the setup phase to securely exchange a pair of symmetric keys ▪ Afterwards a secure channel is established based on symmetric keys - Digital Signatures o Used for authenticity of message and integrity of message (unchanged) o Sender encodes digest with private key and sends digest with message o Receiver creates same digest (with sender public key) and compares it to digest from message - Digital Certificate o Information about person/company that is digitally signed by certificate authority (CA) - HTTPS Secure (HTTPS) o Combines HTTP with asymmetric, symmetric and certificate-based cryptography o https:// URL prefix - Email Security o Emails sent as unencrypted plain text o Stored on multiple intermediary servers before reaching target ▪ Easy to intercept o Third party tools such as Pretty Good Privacy o Email spam → phishing attacks with botnets o Botnets → infected machines able to remove controlled ▪ Distributed denial of service attacks (DDOS) → server attack - Firewalls o Artificial bottlenecks to block specific ports, filter and block content, protect private intranets - Privacy o Continuous logging of requests o Stored by server o Possible to make user profiles with these logs o Internet Archive (all information every published) - Web Log o IP address o URL accessed o Request time o Refer link (where you came from) → potentially sensitive information o Brower type o … o Site owners can use various tools to analyse log files o Use of data to be mentioned in privacy policy 24 - Cookies Revisited o Persistent cookies (similar to IP, but more precise) o Third party cookies can be used to build user profile → targeted ads, sold o Shouldn’t be used for authentication (cookie poisoning) - Web Bugs o User tracking o Embedded in webpage o Informs server whenever page is accessed o Can be embedded in emails, word docs, … - Other services with privacy issues o Google earth (military bases) o Google street view (individual privacy violation) o Free google (and other company) services - Google analytics o Tool for web administrators to analyse web traffic on their website Future Trends - The Future of the Web o Web of documents → Web of structured data and services ▪ Semantic web and linked data ▪ Could computing o The internet as One Global Machine ▪ Automatic reasoning ▪ Interoperability of services o The Mobile Web ▪ Access information from anywhere anytime ▪ Feed the machine ▪ Teach the machine new relationships between data o Internet of Things (Web of Things) ▪ Integration of physical objects with the global machine ▪ Sensory input data to digital space ▪ Augmented reality o Personal data storage from personal computer to global machine o User Interfaces for global machine ▪ Personalised filtering and recommendations (risk of filter bubble) ▪ Cross-media browsing - New forms of User Interfaces o Concept of doc still relevant? o Semantic zooming into concepts o Natural user interfaces (augmented reality interfaces) o Linked data to overcome limitations of existing document-centric desktop interfaces - Solid (Social Linked Data) o Web decentralized (Tim Berners-Lee) o Set tools and conventions based on Linked Data principles (URI, RDF(S), OWL, SPARQL) 25 o Gives users control over their data (true data ownership) ▪ personal online datastores (pods) users decide where their pods are hosted ▪ decoupling of applications and data data can be reused across applications avoids vendor lock-in - General Infrastructure for Mixed Reality o W3C WebXR Device API ▪ Cross-platform compatibility ▪ Web-based applications running in browser o W3C Web of Things (WoT) Thing Description ▪ Metadata and interfaces of IoT devices & passive things o Solid project ▪ Decentralized Solid Pods in addition to the web ▪ Collaborative authoring and sharing of information - RSL Hypermedia Metamodel and iServer o Resource, Selectors, Layers o Cross-media platform - Personal Information Management (PIM) o Keeping, organizing and re-finding information o Digital and/or physical → seamless combination o OC2 PIM model (based on RSL hypermedia metamodel) o Explicit and implicit associations between entities - MindXpres Presentation Platform o Flexible representation of presentations ▪ Use of structural RSL links ▪ Separate content and structure ▪ Based on HTML5, CSS, JS o Content-based approach o Transclusion (content reuse) o Non-linear navigation o Rich media types o Associative linking - RSL-based associative file system o Files in multiple folders o Links between stored media o Cross-media Transclusion - RSL-based Hypermedia engine o Translate files from legacy applications to RSL o Compatible with hypermedia-enabled applications 26