Web Technologies Summary PDF

Web Technologies Summary Contents History................................................................................................................................... 2 Web Architectures................................................................................................................. 4 HTML5 and the Open Web Platform........................................................................................ 8 CSS..................................................................................................................................... 10 XML and Related Technologies............................................................................................. 10 Web 2.0 Patterns and Technologies...................................................................................... 13 Semantic Web and Web 3.0.................................................................................................. 16 Web Search and SEO........................................................................................................... 19 Security, Privacy and Trust.................................................................................................... 22 Future Trends....................................................................................................................... 25 1 History - Reading wheel o comparable to modern tabbed browsing o seen as predecessor hypertext - Dewey Decimal Classification o 10 classes * 10 divisions * 10 sections (+ decimals) ▪ E.g. 395 → class 3, division 9, section 5 - As We May Think (Vannevar Bush, 1945) Add paper content !!! o Associative indexing instead of hierarchical indexing (like the mind) o “Origin” hypertext o Memex (memory extender) → hypertext machine ▪ Trails (cross-references) between microfilms (pieces of information) ▪ Trail blazers → job for trail makers (idea) - Hypertext (Ted Nelson, 1965) o Xanadu (first hypertext project) ▪ Transclusion (referencing/embedding parts of a document in another) ▪ Bidirectional links o Digital document constrained to printable content (PAPER UNDER GLASS) - Hypertext Editing System (HES) o Limitations ▪ Unidirectional links ▪ Non overlapping links (no separate links for overlapping content in doc) ▪ Only embedded links (no storing of separate creation of links) o File Retrieval and Editing System (FRESS) ▪ Follow-up project ▪ First introduction Undo - oNLine System (NLS) → vision future interactive computing o first practical use of hypertext o computer mouse o remove collaboration o... - Hypermedia → extension of hypertext with other media types o Aspen Moviemap (Early hypermedia system) ▪ Pictures taken every 10 feet while driving through the city ▪ Similar concept now used in Google Streetview - HyperCard (Apple, 1987) o Early widespread hypermedia system o Information stored in cards, arranged into stacks o Links can be defined between cards o May contain text, pictures, audio, video o HyperTalk programming language 2 - ARPANET (Advanced Research Projects Agency NETwork) (1969) o Early version internet o First operational packet switching network ▪ Messages passed via other connect machines → no direct connection o Applications: Email, FTP o Network over mostly USA university machines ( + few Europe) - Transmission Control Protocol (TCP) (1974) o Replacement network control protocol (NCP) o “Assume that hardware is unreliable, build reliability into software” o Protocol for Packet Network Interconnection o Transition ARPANET → TCP/IP - TCP/IP (1978) → 4 layers HTTP, FTP, POP, … TCP, UDP, … Addressing hosts and packet routing Link layer - World Wide Web (WWW) o Originally: The Information Mine (Tim) → networked hypertext system over ARPANET to share information at CERN o By 1990 ▪ HyperText Transfer Protocol (HTTP) ▪ HyperText Markup Language (HTML) ▪ HTTP server software ▪ Web browser (WorldWideWeb) o Networked-enabled version of the HES model ▪ Unidirectional links (no bidirectional links) ▪ No Transclusion ▪ No external (non-embedded) links - Mobile Web o Access the Web from anywhere at anytime - Web 2.0 o No new technology! o User becomes author (wikis, tagging, …) - Web 3.0 o Semantic Web / Machine-interpretable Web o Add explicit semantics to web resources o Use of ontologies o Potential reasoning over web resources 3 - Internet of Things / Web of things o Physical objects with embedded computing functionality with active/passive participation in the Web o Ubiquitous Computing, Disappearing Computing, Pervasive Computing Web Architectures - Basic Client-Server Web Architecture o Effect of going to http://www.vub.be: 1) Use a Domain Name Service (DNS) to get IP address for www.vub.ac.be 2) Create a TCP connection to IP address 3) Send a HTTP request message over the TCP connection 4) Visualise the received HTTP response message in the browser - Web Server o Tasks 1) Setup connection 2) Receive and process HTTP request 3) Fetch resource 4) Create HTTP response 5) Logging o Prominent web servers: nginx, Apache HTTP Server - HTTP Protocol o Communication always initiated by client o Stateless protocol (no sessions) o HTTPS scheme for encrypted connections - Uniform Resource Identifiers (URI) o Uniform Resource Locator (URL) ▪ Information about exact location resource ▪ scheme, host, path (e.g. https://vub.academia.edu/BeatSigner) ▪ resource moved → URL different → Persistent URL (PURLs) o Uniform Resource Name (URN) ▪ Unique and location independent name ▪ scheme name, namespace identifier, namespace-specific string (e.g. urn:ISBN:3837027139) 4 - HTTP Message Format o Request ▪ Start line ▪ Methods GET: get resource HEAD: get header POST: send data (in the body) PUT: store request body TRACE: get the “final” request (after it has potentially been modified by proxies) OPTIONS: get list of (HTTP) methods supported by servers DELETE: delete a resource o Response ▪ Start line ▪ Status codes 100-199: informational 200-299: success 300-399: redirection 400-499: client error 500-599: server error o Header fields ▪ General, specific, entity and extension headers: Accept → Media type client will accept User-Agent → type of client Keep-Alive / Persistent → for performance (otherwise new HTTP connection for every webpage element) Content-type → body’s media type If-Modified-Since (with GET → conditional GET) → resource returned if modified since specified date 5 o Media types (MIME Types) → defines message body content (for processing) ▪ e.g. text/plain, text/html, image/jpeg, … o Message Information (viewable in browser developer console) - Proxies o Web proxy between client and server o Server for client, client for server o Used for: firewalls, content filters, transcoding, content router, …) - Caches o Proxy server to reduce server load if multiple clients share same cache o Cache hierarchies (communicate with Internet Cache Protocol ICP) o Special HTTP cache control header fields: expires, max-age, no-cache o Validators: Last-modified time, Entity tags o Advantages: ▪ Reduced latency and bandwidth ▪ Reduced server load ▪ Transparent o Disadvantages ▪ Additional hardware required ▪ Might get outdated data out of cache ▪ Server less control - Tunnelling o Transmit one protocol encapsulated inside another protocol o Often used to open a firewall to protocols that would otherwise be blocked 6 - Gateways o Glue/translator between client and server o Application server → combination gateway and destination server - Session management → because HTTP is stateless, possible implementations: o IP address (often not unique to user) o Browser login (HTTP authentication headers, send user info with each request) o URL rewriting (add information to URL in each request) o Hidden form fields (like URL rewriting, but in body) o Cookies - Cookies o Piece of information assigned to the client on their first visit o Key value pairs o Sent via Set-Cookie HTTP response headers o Sent back every time same server is accessed o Potential privacy issues ▪ Persistent cookies with long lifetime ▪ Third-party cookies for user tracking across websites - HTML o Dominant markup language for webpages - Dynamic Web Content o Server side ▪ Common Gateway Interface (CGI) Certain requests forwarded via CGI to a program Program processes request and creates answer Problems: poor performance (new database connections) Solution: FastCGI with persistent processes and process pools ▪ Java Servlets Java class that extends the abstract HTTPServlet class Loaded by a servlet container Relevant requests are forwarded to servlet instance for further processing Disadvantage: whole page must be defined within servlet ▪ Jakarta Server Pages (JSP) Add program code through scriptlets and markup to existing HTML pages Interpreted on the fly or compiled into java servlets ▪ Node.js Server-side JavaScript Handle requests, database, sessions, … High modularity (packages, frameworks, …) 7 o Client side ▪ JavaScript Interpreted scripting language for client-side processing Embedded in HTML or separate file ▪ Java Applets Program delivered to the client side in the form of Java bytecode and runs in a sandbox (JVM) Advantages: most recent version in browser, high security Disadvantages: Java plug-in, advanced functionality for signed applets Replaced by Java Web Start (JavaWS) o Web Application Frameworks ▪ Software framework to support development of dynamic websites, web applications, web services and web resources. ▪ Database access, templating frameworks, session management, code reuse ▪ Faster more robust development process ▪ Model-View-Controller (MVC) design pattern o No mix of application logic and view o Model ▪ Data (state) and business logic ▪ Multiple views can be defined for single model ▪ State model changes → view notified o View ▪ Renders data of the model ▪ Notifies controller about changes o Controller ▪ Processes interactions with the view ▪ Transforms view interactions into operations on the model (state modification) HTML5 and the Open Web Platform - HyperText Markup Language (HTML) o Application of Standard Generalized Markup Language (SGML) o Markup tags to define the structure and presentation of an HTML doc (webpage) o Nested tags, tags with attributes o HTML → Document Object Model (DOM) by browser ▪ Standard to create, read, update and delete HTML elements o Hyperlinks to connect different HTML documents (unidirectional, embedded) - History o HTML 1.0 o HTML 2.0 ▪ browser war between Netscape and Internet Explorer o HTML 3.2 ▪ Developed only by W3C ▪ Tables, visual appearance elements (not the original idea for HTML!!) 8 o HTML 4.0 ▪ Unicode (internationalisation) ▪ CSS ▪ W3C stopped developing HTML o XHTML 1 ▪ XML application of HTML ▪ Draconian error handling (handle errors in HTML docs) o XHTML 2.0 ▪ Revolutionary changes, but broke backwards compatibility, meaning that originally HTML pages had to fix all the error that were ignored o HTML5 ▪ Developed by Web Hypertext Application Technology Working Group (WHATWG) and W3C ▪ HTML – Living Standard → current standard ▪ Continually developed by community - Problems o Mix content, structure, presentation o Forgiving browsers rendering HTML docs with errors - XHTML o HTML as XML application (instead of SGML) o Strict adherence to standard - HTML5 Design Principles o Compatibility ▪ Evolve the language (backward compatibility) o Utility ▪ Separation of content and presentation ▪ Solve real problems (pragmatic approach) o Interoperability ▪ Interoperable browser behaviour ▪ Identical error handling across browsers o Universal Access ▪ Work across platforms, devices and media ▪ Accessible to users with disabilities o Simple is better o Avoid external plug-ins - Open Web Platform APIs o Standard way for accessing specific functionality o E.g. SVG, RDFAs, Geo Location, JavaScript, File API, Fullscreen, … - HTML5 Markup o Added structural and media tags o Removed presentation tags o Improved forms → simple HTML client-side form validation o Video, audio , 2D and 3D graphics, vector graphics - Web Storage API o localStorage (same-origin policy → no time limit) o sessionStorage (per window → delete when browser window closed) o replace cookies for large data 9 - Web workers o Execute JavaScript in background (otherwise page non-responsive) o Avoid complexity → independent JavaScript contexts + event-driven message passing - WebSocket API o Bidirectional, full-duplex socket connection o Allows server-initiated updates o HTTP → WebSocket P (Connection: Upgrade and Upgrade: websocket in header) - Geolocation API - Offline Web Application - Fullscreen API - Screen Orientation API - Page Visibility API - Battery Status API - Vibration API - Web notification API - … CSS - Cascading Style Sheets (CSS) o Separation presentation (css) and content (html) o Enable multiple presentations of the same content o Versions: ▪ CSS1 ▪ CSS2 → relative, absolute, fixed positioning ▪ CSS3 → 2D and 3D transformations, transitions, flex, media queries, … divided into separate modules - Inclusion o Inline style ▪ Mixes content and presentation o Internal style sheet ▪ In style tag in head HTML doc o External style sheet ▪ Link to stylesheet XML and Related Technologies - eXtensible Markup Language (XML) o standardised text format for (semi-)structured information o meta markup language → tool for defining other markup languages o ordered labelled tree (think HTML structure) o simple, general, accepted o not a programming language o not a database (lacks database management system features) 10 - Evolution o descendant of Standard Generalized Markup Language o “SGML-Lite” - XML Specification o Grammar for XML documentation (tag placement, legal element names, …) o General tools (parsers, editors, programming APIs) - XML Tree Document Structure (7 node types) o Root node o Element node o Attribute node o Text node o Comment node o Processing instruction node o Namespace node (useful if use more than one specification) - Well-Formedness and Validity o Well formed → follows XML specification (correct nesting, valid names, …) o Valid → follows Document Type Definition (DTD) or XML Schema ▪ Custom specification by developer ▪ Can be checked with accompanying parser - How XML is different from HTML o Tool for specifying markup languages o Not a presentation language o Support applications outside of web browsing as well o Must be well-formed and valid o Readability > conciseness o Matching tags are case sensitive - XHTML o Reformulation of HTML into XML application o Stricter than HTML (title first element in head, lowercase, namespace declared, …) - XML Technologies o XPath and XPointer ▪ Addressing XML elements and parts of elements o XSL (Extensible Stylesheet Language) ▪ Transforming XML documents o XLink (XML Linking Language) o XQuery (XML Query Language) o Document Type Definition (DTD) and XML Schema ▪ Definition schemas XML documents ▪ DTD → limited expressive power ▪ XML Schema → datatypes, inheritance, … o SAX (Simple API for XML) ▪ Event-based programming API for reading XML documents) o DOM (Document Object Model) ▪ Programming API to access and manipulate XML documents as tree structures 11 - XPath o Expression language to address elements of an XML doc o Location path → sequence of location steps separated by slash - XSLT (Extensible Stylesheet Language Transformations) o Expression-based language based on functional programming concepts o Most important part of XSL o Uses XPath for navigation o Pattern matching to select parts of documents o Templates to perform transformations - XPointer (XML Pointer Language) o Address points or ranges o Uses XPath o Relative addressing, links to elements without anchors - XLink (XML Linking Language) o Create links in XML docs o Links can be defined in separate documents o Simple links (thing HTML) and extended links (associate arbitrary number of resources) o Annotea project (uses XLink for managing external annotations) used in Amaya Web Browser - SAX (Simple API for XML) o Scans doc, invokes callback methods at start doc, end doc, start tag, end tag, character data, processing instruction o Less memory than DOM parser - Document Object Model (DOM) o Language neutral API for accessing and manipulating XML documents as a tree structure o Whole doc must be read and parsed before using in DOM application o Types of DOM core interfaces ▪ Node generic interface ▪ Node specific interface - XML for Data Interchange o General way to query data from different systems (e.g. via XQuery) o Connect applications running on different operating systems and computers with different architectures ▪ XML Remote Procedure Call (XML-RPC) ▪ Simple Object Access Protocol (SOAP) Successor Used for accessing Big Web Services - XML Remote Procedure Call (XML-RPC) o Advantages: ▪ Understood by different applications (XML-based lingua franca) ▪ HTTP carrier protocol ▪ Based on HTTP and XML standards → easy to implement o Disadvantages: ▪ Slower then specialised protocols used in closed networks o GOMES (GUI for Object Model multi-user Extended fileSystem) ▪ Uses XML-RPC to communicate with OMES (coded in Oberon) 12 - eXtensible Information Management Architecture (XIMA) o generic database interface o XSLT stylesheet chosen on device type (User-Agent HTTP header field) o HTML pc browser, WML old phone browser, VXML voice browser, … Web 2.0 Patterns and Technologies - Web 2.0 o New generation of web apps o User generated content o Data as a driving force o Collective intelligence o Web as a platform (instead of static pages) o Not a new technology - Main ingredients o Social Web ▪ prosumer → producer + consumer (e.g. wikis, blogs, social medea) ▪ democracy o Rich Internet Applications (RIAs) ▪ Desktop on browser ▪ Highly interactive applications (e.g. google docs) ▪ Based on AJAX o Service Oriented Architectures (SOAs) ▪ Enable sharing of information and services between different Web 2.0 applications - The Long Tail o New economic model: combine infinite shelf space with shared real-time public opinions and buying trends o Major part of web content are small sites → provide tools to address long tail (as opposed to only the head like popular content) - Wikis o Any user can create new pages or edit existing pages o Democracy-based control of content o Reliability never guaranteed - Blogs o Chronologically ordered list of information o Delivering news and getting in touch with community - Flickr o Image hosing and sharing website o User-generated taxonomy (folksonomy) o Images may be added to multiple albums (which filesystem lacks) - Folksonomies o Folk + taxonomies o User generated taxonomy o Social tagging (e.g. Instagram, Annotea) - Social Implications of Web 2.0 o Data ownership and copyright issues 13 o Collective intelligence (wisdom of crowds) o Controlled media (e.g. CNN) → collaborative communities (e.g. twitter) o New crediting models o Everybody has a (big) voice o The kindness of strangers (video content !!) - The Programmable Web o Based on HTTP o Data encoded in XML (or JSON, plain text, HTML, binary) - Rich Internet Applications (RIAs) o Bring desktop to browser o Highly responsive (async and partial content updates) o Rich Graphical user Interface (GUI) - Asynchronous partial Updates o Asynchronous update parts of resource (instead of whole resource) o Initiated by client (keypress, state change, …) o Updated cannot be initiated by the server if HTTP is used! o AJAX - Asynchronous JavaScript and XML (AJAX) o (nowadays JSON instead of XML) o Not a technology by itself ▪ HTML and CSS for visualisation ▪ JavaScript with DOM for dynamic change of information presented ▪ Method to asynchronously exchange data between client and server ▪ Client-side AJAX engine (deal with asynchronous message handling) o XMLHttpRequest Object ▪ onreadystatechange (register callback) ▪ readyState (response status server) 0 (uninitialized) → created, but not initialised 1 (open) → created, but send method not called 2 (sent) → send called, HTTP response header received 3 (receiving) → response not yet available 4 (loaded) → response available, data received ▪ responseText, responseBody, responseXML (server response) o Advantages ▪ Reduced load time and higher responsiveness ▪ State can be maintained o Disadvantages ▪ Not possible to bookmark any particular state of an application ▪ Content might not be crawled ▪ Cannot be used in browsers with disabled JavaScript functionality - Service-Oriented Architecture (SOA) o Architecture that modularises functionality as interoperable services o Software as a service - Representational State Transfer (REST) o Architectural style for distributed hypermedia systems o Application is a RESTful service if it follows constraints: ▪ Separation of concerns between client and server 14 ▪ Uniform interface Identification of resources (like URIs on the Web) Manipulation of resources on the server via representation on the client side Self-describing messages (like media type on the Web) Hypermedia for application state change (like hypertext links to related resources) ▪ Stateless (No client side state stored on server) ▪ Cacheability (response say if they’re cacheable or not) ▪ Layering (proxies can be transparently added) ▪ Code on demand (server can send application logic to the client) Optional - Web Services o Web-based client-server communication over HTTP o Big Web Services ▪ Web Service Description Language (WSDL) XML application to describe a Web Service’s functionality Complex (usually generated with third party service) ▪ Universal Description, Discovery and Integration (UDDI) Yellow pages for WSDL Global registry describing available business services Very complex ▪ Simple Object Access Protocol (SOAP) XML based communication protocol Defines an envelope for transporting XML messages Often sent via HTTP POST requests Advantages: o Platform and language independent o SOAP over HTTP → less issues with proxies and firewalls Disadvantages: o HTTP reduced to simple transport protocol o No caching ▪ Web Service Stack → contains many other protocols o RESTful Web Servies ▪ Simple web service implemented using HTTP ▪ RESTful web service definition URI + supported datatypes + supported HTTP methods ▪ One to one mapping of CRUD operations: POST (create), GET (read), PUT (update), DELETE (delete) o Really Simple Syndication (RSS) ▪ Format (in XML) that is used to read and write frequently updated information on the web (e.g. blog entries, news channel) 15 Semantic Web and Web 3.0 - The Semantic Web o Meaning of data on the web also discovered by machine without human intervention o Web of Documents → Web of Data ▪ Web as a decentralized database (knowledge base) ▪ Machine-accessible data ▪ Interconnected (like current web) ▪ Machine-readable metadata for existing web content ▪ combination of data from different sources to derive new facts ▪ machines use logical reasoning to infer facts that are not explicitly recorded o Crucial component of Web 3.0 / Giant Global Graph - Semantic Web Stack o Architecture of the Semantic Web o URI/IRI ▪ Unique identification of semantic web resources o Unicode ▪ Representing/manipulating text in different languages o XML ▪ Interchange of structured data over the Web o XML Namespaces ▪ Uniquely qualify markup from multiple sources (integration) o Resource Description Framework (RDF) ▪ Define RDF triples and represent resource information in a graph structure ▪ Instance level o RDF Schema (RDFS) ▪ Create hierarchies of classes and properties ▪ Class level o Web Ontology Language (OWL) ▪ Language to define vocabularies ▪ Extends RDFS with more advanced features (e.g. cardinality) ▪ Enables reasoning based on description logic o SPARQL ▪ Query language to query any RDF-based data o Rule Interchange Format (RIF) and Semantic Web Rule Language (SWRL) ▪ Describe relations that cannot be described in OWL o Unifying Logic ▪ Logical reasoning (infer new facts and check consistency) o Proof ▪ Explain logical reasoning steps 16 o Cryptography ▪ Protect RDF data via encryption ▪ Validate the source of facts by digitally signing RDF data o Trust ▪ Authentication of sources and trustworthiness of derived facts o User Interface ▪ User interfaces for semantic web applications - Resource Description Framework (RDF) o Describes data and metadata about specific subjects, structure of data sets, relationships between bits of data o RDF statement (triple) consists of {subject, predicate/property, object/value} o Subjects, predicates and objects are all resources ▪ Subject → URI reference or blank node ▪ Predicate → URI reference defining the relationship ▪ Object → URI reference, literal or blank node o Stored in relational databases or triplestores o Advantages : ▪ Simple ▪ Enables merging data from different data models (only URI needed) ▪ Same resource can be annotated by different people ▪ Well-defined standard - RDF Graph o Set of RDF statements can be represented as a directed labelled graph o Nodes → specific instances (because RDF is instance based) o Anonymous resources don’t have explicit identifiers → blank node - RDF Reification o RDF triple is not a resource o Reify statement → make resource out of statement as a blank node ▪ (blanknode, isSubject, originalStatementSubject) ▪ (blanknode, isObject, originalStatementObject) ▪ (blanknode, Property, originalStatementProperty) - RDF Schema (RDFS) o Vocabulary description language for RDF o Define common concepts and relationships ▪ Classes and subclasses ▪ Properties and subproperties ▪ Domain and range of a property ▪ See Also :> isDefinedBy ▪ Label, comment , … o Provides basic elements for the definition of ontologies o Advantages: ▪ Richer expressiveness with RDFS ▪ Simple reasoning ▪ Many existing tools to deal with RDFS o Disadvantages: cannot express ▪ Requirement ▪ Cardinality ▪ Symmetry 17 o RDS(S)/XML Serialisation ▪ Standard is hard ▪ RDF Notation 3 (N3) (short non-XML serialisation) ▪ RDF Turtle Notation (removes unnecessary features from N3) o RDF applications ▪ Annotea project (defines RDF Schema for types of annotations that can be used to annotate webpages) ▪ RSS ▪ Dublin Core (widely used to describe digital media also in HTML) - SPARQL Query Language o Extract information as URIs, literals, blank nodes or subgraphs o Syntax similar to SQL - Jena Semantic Web Framework for Java - Protégé (free open-source platform to create, manipulate and visualize ontologies) - Friend of a Friend (FOAF) o First social semantic web application o Describe a social network without a central database - Semantic Wikis o Use semantic web technologies to provide machine-processable Wiki content o Much richer query interface o Existing semantic wikis: DBPedia (semantic web version of Wikipedia) - Linked Data o Link different data sources on the Web o Linked Open Data cloud project ▪ RDF triples from currently 1000+ datasets - Semantic Desktops o Apply semantic web technologies to personal information management (PIM) ▪ Inter-application data sharing ▪ Enhancement of limited filesystem functionality - Microformats o Add semantics to (X)HTML pages (through classes) o Some search engines pay attention to different types of microformats - RDF in Attributes (RDFa) o Add a set of attribute extensions to (X)HTML for embedding RDF metadata o Search engines process certain RDFa metadata (e.g. product information) - Microdata o HTML5 in-house support o Add machine readable metadata (semantics) to HTML5 documents in the form of key/value pairs o Can be used by crawlers, search engines and browsers for richer browsing experience o Alternative to microformats and RDFa 18 Web Search and SEO - Search Engine Result Page (SERP) o Organic and non-organic (paid) search results - History o Start at Bush’s Memex o Archie (first internet search engine) o W3Catalog (first web search engine, manually maintained) o JumpStation (first engine to combine crawling, indexing and searching) o A lot of new search engines o Early web search solutions ▪ Full-text search (web crawler + indexer) ▪ Manually maintained classification - Information Retrieval (performance measures) o Precision: ratio right picks and all picks o Recall: ratio right picks and all right ones o F-score: 2 * (precision * recall) / (precision + recall) - Boolean Model o Index of keywords x documents with true or false for if a keyword appears in a document o Uses boolean operators to find result set o Advantages: ▪ Easy to implement, scalable ▪ Fast query processing o Disadvantages: ▪ No ranking of output ▪ Often user needs to learn special syntax to search for phrases o Variant (like inverted index) form basis of many search engines - Web Search Engines o Most based on traditional information retrieval techniques o Deal with: ▪ Immense amount of data ▪ Hyperlinked resources ▪ Dynamic content with frequent updates ▪ Self-organized web resources o Query answer time - Web Crawler o Used to create index of webpages to be used by search engine o Must deal with following issues: ▪ Freshness (updated frequently) ▪ Quality (priority for high quality pages) ▪ Scalability (should be able to increase crawl rate by adding more servers) ▪ Distribution (run in a distributed manner) ▪ Robustness (deal with page errors and crawler traps) ▪ Efficiency (resources used in most efficient way) ▪ Crawl rates (pay attention to existing web server policies) 19 - Pre-1998 Web Search o Ranking based on on-page factors (IR): poor quality of search results (order) o PageRank → absolute quality of a page ▪ Query-independent - PageRank o High PageRank → many pages linking to it, high rankers linking to it o Total score = IR score * PageRank o Algorithm: ▪ Sum over every incoming page: rank page / number of outgoing links ▪ Slow formula → transformed into matrix multiplication ▪ Power method to find R (Rt+1 = HRt) o Dangling Pages (Rank Sink) ▪ Pages without outgoing links become PageRank sinks ▪ Solution: add artificial outgoing links to all pages (including itself) (Stochastic adjustement) o Strongly Connected pages (Graph) ▪ Add new probabilities between all pages ▪ Prob d → follow hyperlink structure ▪ Prob d – 1 → choose random page ▪ Matrix G represents a random surfer - Google Search Central o Services and information about a websites o Site configuration (submission of sitemap, crawler access, URLs of indexed pages) o Site performance (search queries, countries, devices, …) o Enhancements (core web vitals, mobile usability) o Security issues - XML Sitemaps o List of URLS that should be crawled and indexed (with frequenties) o Just a suggestion (not guaranteed to be added to index) o Additional metadata might be provided to search engines - Search engine Marketing (SEM) o Aims to increase visibility of a website ▪ Search engine optimisation (SEO) ▪ Paid search advertising ▪ Social media marketing o SEO should not be decoupled from content, structure, design, … o SEO is a continuous process (rapidly changing environment) - Search Engine Optimisation (SEO) o On and off page factors o Difference between optimisation and breaking search engine rules (white hat and black hat optimisations) o Break rules → to google hell (to the supplemental index) o Positive On-Page factors ▪ Use of keywords in relevant places (title tag, url, domain name, header tags, …) ▪ Mobile usability (mobile-first indexing) ▪ Fast page load times 20 ▪ Provide metadata ▪ Quality of HTML code ▪ Security and accessibility ▪ Uniqueness of content across the website ▪ Flat website structure (minimise link depth) ▪ Use keyword-rich anchor texts ▪ Avoid PageRank leakage ▪ Increase number of pages ▪ Think about hidden content (RIA) ▪ Consistent webpage addressing) o Negative On-Page factors ▪ Links to “bad neighbourhoods” ▪ Link selling ▪ Keyword stuffing ▪ Hidden content (same colour as background) ▪ Cloaking (different content for spider and user) ▪ Malware being hosted on the page ▪ Duplicate or similar content ▪ Slow page load time ▪ Copyright violations o Positive Off-Page factors ▪ Links from pages with high PageRank ▪ Keywords in anchor text of inbound text ▪ Links from topically relevant sites ▪ High clickthrough rate from search engine ▪ High number of shares ▪ Site age (implying stability) ▪ Domain expiration date (implying longevity) o Negative Off-Page Factors ▪ Not accessible to crawlers ▪ High bounce rate (back button when entering page) ▪ Link buying ▪ Link farms ▪ Spamdexing (adding links to your page from other pages) Solution ‘nofollow’ links Exploited by SEO experts: PageRank sculpting (control flow PageRank within website) ▪ Links from bad neighbourhoods? ▪ Duplicate content (from competitor)? - Non-organic Search o Cost per impression or cost per click o Not independent from organic search ▪ E.g. Landing page o Google ads → non-organic web search service ▪ Ads or search boost 21 Security, Privacy and Trust - Security Aspects o Authenticity (knowing the sender or receiver of data) o Privacy (keeping information private) o Integrity (ensuring information is not changed when transferred) - HTTP Authentication o Native authentication functionality offered by HTTP ▪ Server can first respond with authentication challenge to request o HTTP is extensible → support other authentication (like TLS) protocols and offers ▪ Basic access authentication (Base64 encoding username:password) ▪ Digest access authentication o Security realms → grouped protected resources with different sets of authorised users or groups of users - Basic Access Authentication o Base64 Encoding ▪ Represent binary data in portable format ▪ Takes sequence of bytes and breaks it into 6-bit chunks ▪ 6-bit chunks represented by 64-character alphabet ▪ Not secure (easily reversible) → wasn’t made for encryption o Web Server Configuration ▪ Create password file ▪ Put.htaccess file with configuration into directory that has to be protected o Basic Access Authentication not secure ▪ Username password sent almost in cleartext ▪ Easy to do replay attacks o Potential solutions ▪ Combine BAA with encrypted data transfer → does not prevent replay attacks ▪ Digest Access Authorization 22 - Digest Access Authentication o One-way digest computed to send to server ▪ With hash function (irreversible) o Servers sends special token (nonce) that changes with every request ▪ Incorporated in digest ▪ → server sees if nonce is used again ▪ Pre-emptive authorization → send next nonce in advance and send computed hash with the original request - Transport Layer Security o Cryptographic protocol to ensure secure network communication o Situated at the application layer or presentation layer o Types of authentication ▪ Unilateral authentication (server side) ▪ Mutual authentication (client and server side) - Cryptography o Cipher (coding scheme) used with a key to create ciphertext out of plaintext o Cryptoanalysis → get information out of ciphertext without having access to secret information or key - Symmetric Key Cryptography o Same key for encoding and decoding o Only key secret (algorithms are public) o Enumeration attack → tries all keys o Problem: ▪ secretly share the common key ▪ repeated for every pair of communicators ▪ insecure channels ▪ storage? - Public Key (Asymmetric Cryptography) o Asymmetric pair of keys ▪ Publicly available key for encoding ▪ Secret key for decoding o Each party has a single public key used to encode messages for that party o That party can decode those messages with their private key o No need to secretly share keys anymore 23 o RSA cipher ▪ Public key cipher for encryption and signing ▪ Keys generated based on multiplication of large prime numbers ▪ Factorisation (basically) not possible (unless in possession of insanely powerful computer like quantum computer) o Problem: ▪ Much slower than symmetric ciphers o Hybrid solution: ▪ Public key encryption used in the setup phase to securely exchange a pair of symmetric keys ▪ Afterwards a secure channel is established based on symmetric keys - Digital Signatures o Used for authenticity of message and integrity of message (unchanged) o Sender encodes digest with private key and sends digest with message o Receiver creates same digest (with sender public key) and compares it to digest from message - Digital Certificate o Information about person/company that is digitally signed by certificate authority (CA) - HTTPS Secure (HTTPS) o Combines HTTP with asymmetric, symmetric and certificate-based cryptography o https:// URL prefix - Email Security o Emails sent as unencrypted plain text o Stored on multiple intermediary servers before reaching target ▪ Easy to intercept o Third party tools such as Pretty Good Privacy o Email spam → phishing attacks with botnets o Botnets → infected machines able to remove controlled ▪ Distributed denial of service attacks (DDOS) → server attack - Firewalls o Artificial bottlenecks to block specific ports, filter and block content, protect private intranets - Privacy o Continuous logging of requests o Stored by server o Possible to make user profiles with these logs o Internet Archive (all information every published) - Web Log o IP address o URL accessed o Request time o Refer link (where you came from) → potentially sensitive information o Brower type o … o Site owners can use various tools to analyse log files o Use of data to be mentioned in privacy policy 24 - Cookies Revisited o Persistent cookies (similar to IP, but more precise) o Third party cookies can be used to build user profile → targeted ads, sold o Shouldn’t be used for authentication (cookie poisoning) - Web Bugs o User tracking o Embedded in webpage o Informs server whenever page is accessed o Can be embedded in emails, word docs, … - Other services with privacy issues o Google earth (military bases) o Google street view (individual privacy violation) o Free google (and other company) services - Google analytics o Tool for web administrators to analyse web traffic on their website Future Trends - The Future of the Web o Web of documents → Web of structured data and services ▪ Semantic web and linked data ▪ Could computing o The internet as One Global Machine ▪ Automatic reasoning ▪ Interoperability of services o The Mobile Web ▪ Access information from anywhere anytime ▪ Feed the machine ▪ Teach the machine new relationships between data o Internet of Things (Web of Things) ▪ Integration of physical objects with the global machine ▪ Sensory input data to digital space ▪ Augmented reality o Personal data storage from personal computer to global machine o User Interfaces for global machine ▪ Personalised filtering and recommendations (risk of filter bubble) ▪ Cross-media browsing - New forms of User Interfaces o Concept of doc still relevant? o Semantic zooming into concepts o Natural user interfaces (augmented reality interfaces) o Linked data to overcome limitations of existing document-centric desktop interfaces - Solid (Social Linked Data) o Web decentralized (Tim Berners-Lee) o Set tools and conventions based on Linked Data principles (URI, RDF(S), OWL, SPARQL) 25 o Gives users control over their data (true data ownership) ▪ personal online datastores (pods) users decide where their pods are hosted ▪ decoupling of applications and data data can be reused across applications avoids vendor lock-in - General Infrastructure for Mixed Reality o W3C WebXR Device API ▪ Cross-platform compatibility ▪ Web-based applications running in browser o W3C Web of Things (WoT) Thing Description ▪ Metadata and interfaces of IoT devices & passive things o Solid project ▪ Decentralized Solid Pods in addition to the web ▪ Collaborative authoring and sharing of information - RSL Hypermedia Metamodel and iServer o Resource, Selectors, Layers o Cross-media platform - Personal Information Management (PIM) o Keeping, organizing and re-finding information o Digital and/or physical → seamless combination o OC2 PIM model (based on RSL hypermedia metamodel) o Explicit and implicit associations between entities - MindXpres Presentation Platform o Flexible representation of presentations ▪ Use of structural RSL links ▪ Separate content and structure ▪ Based on HTML5, CSS, JS o Content-based approach o Transclusion (content reuse) o Non-linear navigation o Rich media types o Associative linking - RSL-based associative file system o Files in multiple folders o Links between stored media o Cross-media Transclusion - RSL-based Hypermedia engine o Translate files from legacy applications to RSL o Compatible with hypermedia-enabled applications 26

Web Technologies Summary PDF

Document Details

Tags

Related

Summary

Full Transcript