How Web Works.pdf
Document Details
Uploaded by UserFriendlyNeon9995
Creede K-12
Full Transcript
Internet Programming: how the web works Dr Kafi Rahman, PhD Email: [email protected] Associate Professor @CS Truman State University Agendas The fundamental protocols that make the web possible How the domain name system works Why HTTP is more than just a four-letter abbreviation How br...
Internet Programming: how the web works Dr Kafi Rahman, PhD Email: [email protected] Associate Professor @CS Truman State University Agendas The fundamental protocols that make the web possible How the domain name system works Why HTTP is more than just a four-letter abbreviation How browsers and servers work to exchange and interpret HTML Internet protocols A protocol is a set of rules that partners use when they communicate. TCP/IP, from last lecture, is an essential internet protocol! These protocols have been implemented in every operating system and make fast web development possible. If web developers had to keep track of packet routing, transmission details, domain resolution, checksums, and more, it would be hard to get around to the matter of actually building websites. The TCP/IP Internet protocols were originally abstracted as a four-layer stack Later abstractions subdivide it further into five or seven layers Since we focus on the top layer, we will use the earliest and simplest four- layer network model. A Layered Architecture Link Layer The link layer is the lowest layer, responsible for both the physical transmission of data across media and establishing logical links. It handles issues like packet creation, transmission, reception, error detection, collisions, line sharing, and more. One term that is sometimes used in the Internet context is that of MAC (media access control) addresses. Internet Layer The Internet layer (sometimes also called the IP Layer) routes packets between communication partners across networks. It provides "best effort" communication. It sends out a message to its destination but expects no reply and provides no guarantee the message will arrive intact, or at all. The Internet uses the Internet Protocol (IP) addresses, which are numeric codes that uniquely identify destinations on the Internet. Every device connected to the Internet has such an IP address. There are two types of IP addresses: IPv4 and IPv6. In IPv4, four 8-bit integers separated by. encode the address. IPv6 uses eight 16-bit integers and has over a billion billion times the number in IPv4 IP Addresses (cont) Transport Layer The transport layer ensures transmissions arrive in order and without error. First, the data is broken into packets formatted according to the Transmission Control Protocol (TCP). Each data packet has a header that includes a sequence number, so the receiver can put the original message back in order Each packet acknowledges its successful arrival back to the sender (ACK). In the event of a lost packet (since no ACK arrived for that packet), the packet will be retransmitted. This means you have a guarantee that messages sent will arrive and will be in order. Transport Layer User Datagram Protocol (UDP) Sometimes we do not want guaranteed transmission of packets. Consider a live multicast of a soccer game, for example. Millions of subscribers may be streaming the game, and the broadcaster can't afford to track and retransmit every lost packet. A small loss of data in the feed is acceptable, and the customers will still see the game. An Internet protocol called User Datagram Protocol (UDP) is used in these scenarios in lieu of TCP. Other examples of UDP services include Voice Over IP (VoIP), many online games, and Domain Name System (DNS). Application Layer The application layer is the HTTP. The Hypertext Transfer level of protocols familiar Protocol is used for web communication. to most web developers. SSH. The Secure Shell Protocol Application layer protocols allows remote command-line implement process-to- connections to servers. process communication. FTP. The File Transfer Protocol is used for transferring files between There are many application computers. layer protocols. A few that POP/IMAP/SMTP. Email-related are useful to web protocols for transferring and developers include: storing email. SMTP stands for Simple Mail Transfer Protocol, and it is responsible for sending email messages. This protocol is used by email clients and mail servers to exchange emails between computers. A mail client and the SMTP server communicate with each other over a connection established through a particular email port. Both entities are What is SMTP? using SMTP commands and replies to process your outgoing emails. The POP3 abbreviation stands for Post Office Protocol version 3, which provides access to an inbox stored in an email server. It executes the download and deletes operations for messages. Thus, when a POP3 client connects to the mail server, it retrieves all messages from the mailbox. What is POP3? Then it stores them on your local computer and deletes them from the remote server. The Internet Message Access Protocol (IMAP) allows you to access and manage your email messages on the email server. This protocol permits you to manipulate folders, What is IMAP? permanently delete and efficiently search through messages. Domain Name System It is challenging to recall long strings of numbers. The DNS system maps resolves domain names to IP addresses. By separating the domain name of a server from its IP address, a site can move to a different host without changing its name. Since the entire request-response cycle can take less than a second, it is easy to forget that DNS requests are happening in all. Name Levels (Top Level) The rightmost portion of the domain name (to the right of the rightmost period) is called the top-level domain. For the top level of a domain, we are limited to two broad categories, plus a third reserved for other use. Generic top-level domain (gTLD) Country code top-level domain (ccTLD).arpa (used for reverse DNS lookups) Generic top-level domain (gTLD) Generic top-level domains (gTLD) include the famous.com and ,org. There are 3 subtypes of gTLD. Unrestricted. TLDs include.com,.net,.org, and.info. Sponsored. TLDs including.gov,.mil,.edu, and others Starting in June 2012, ICANN invited companies to launch new TLDs in order to provide more choice. Since then over 1000 new TLD have been created including.art,.cash,.cool,.jobs,.tax and so on Country code top-level domain Country code top-level domain (ccTLD) are under the control of the countries which they represent, which is why each is administered differently. In the United Kingdom, for example, businesses must register subdomains to co.uk rather than second-level domains directly whereas in Canada,.ca domains can be obtained by any person, company, or organization living or doing business in Canada. Other countries have peculiar extensions with commercial viability (such as.tv) and have begun allowing unrestricted use to generate revenue. Name Registration Q: How then are domain names assigned? A: Special organizations or companies called domain name registrars manage the registration of domain names. These domain name registrars are given permission to do so by the appropriate generic top-level domain (gTLD) registry and/or a country code top-level domain (ccTLD) registry. The nonprofit Internet Corporation for Assigned Names and Numbers (ICANN)—oversees the management of toplevel domains, accredits registrars, and coordinates other aspects of DNS. Domain name registration process Address Resolution Uniform Resource Locators Uniform Resource Locators (URL) allow clients to request particular resources (files) from the server. URL's consist of two required components: the protocol used to connect and the domain (or IP address) to connect to. Uniform Resource Locators (optional) Optional components of the URL are: the path (which identifies a file or directory to access on that server), the port to connect to, a query string, and a fragment identifier Port (URL) A port is a type of software connection point used by the underlying TCP/IP protocol and the connecting computer. If no port is specified, the protocol determines which port to use. For instance, port 80 is the default port for web-related HTTP requests. Syntax is to add a colon after the domain, then specify an integer port number. http:// funwebdev.com:8080/ would connect on port 8080 Query String (URL) Query strings will be covered in depth when we learn more about HTML forms and server-side programming. They are a critical way of passing information, such as user form input, from the client to the server. In URLs, they are encoded as key-value pairs delimited by & symbols and preceded by the ? Symbol An example query string for passing name and password is shown in the figure Fragment (URL) The last part of a URL is the optional fragment. This is used as a way of requesting a portion of a page. Browsers will see the fragment in the URL (denoted by #), seek out the fragment tag anchor in the HTML, and scroll the website down to it. "back to top" links are a common use of fragments. Hypertext Transfer Protocol HTTP is an essential part of the web. HTTP establishes a TCP connection on port 80 (by default). The server waits for the request, and then responds with a Headers, Response code, an optional message (which can include files) HTTP Headers Headers are sent in the request from the client and received in the response from the server. Headers are one of the most powerful aspects of HTTP and unfortunately, few developers spend any time learning about them. Request headers include data about the client machine Host, User-Agent, Cache settings and more The most common requests are the GET and POST request, along with the HEAD request Response headers have information about the server answering the request and the data being sent Server, Last Modified, Content Type, Encoding, etc. The most common type of HTTP request is the GET request. One is asking for a resource located at a specified URL to be retrieved. Whenever you click on a link, type in a URL in your browser, or click on a bookmark, you are usually making a GET request. Data can be transmitted through a GET request, with a query string GET Request The other common request method is the POST request. This method is normally used to transmit data to the server using an HTML form POST Request Response codes are integer values returned by the server as part of the response header. Code Description Response Codes 200: OK The request was successful. 301: Moved Tells the client that the requested resource has permanently moved. Permanently If the client requested a resource with appropriate Cache-Control headers, the response might say that the resource on the server is no newer than the one in 304: Not Modified the client cache. Some web resources are protected and require the user to provide credentials 401: Unauthorized to access the resource. 404 codes are one of the only ones known to web users. Many browsers will display an HTML page with the 404 code to them when the requested 404: Not found resource was not found. 414: Request URI A 414 response code likely means too much data is likely trying to be too long submitted via the URL. 500: Internal This error provides almost no information to the client except to say the server server error has encountered an error. Web Browsers The user experience for a website is unlike the user experience for traditional desktop software. Users do not download software; they visit a URL, which results in a web page being displayed. Although a typical web developer might not build a browser, or develop a plugin, they must understand the browser's crucial role in web development. Web Browsers (Fetching a web page) Seeing a single web page is facilitated by the browser, which requests the initial HTML page, then parses the returned HTML to find all the resources referenced from within it (like images, style sheets, and scripts). Only when all the files have been retrieved is the page fully loaded for the user A single web page can reference dozens of files and requires many HTTP requests and responses. Fetching a web page diagram Browser Rendering The algorithms within browsers to download, parse, layout, fetch assets, and create the final interactive page for the user are commonly referred to collectively as the rendering of the page Browser Caching Once a webpage has been downloaded from the server, it's possible that the user, a short time later, wants to see the same web page and refreshes the browser or re- requests the URL. Although some content might have changed, the majority of the referenced files are likely to be unchanged, so they needn't be redownloaded. Browser caching has a significant impact in reducing network traffic. Web Servers A web server is, at a fundamental level, nothing more than a computer that responds to HTTP requests. Real-world websites typically have many web servers configured together in web farms. Regardless of the physical characteristics of the server, one must choose an application stack to run a website. This application stack will include an operating system, web server software, a database, and a scripting language to process dynamic requests. WAMP software stack We will be using the WAMP software stack, which refers to the Windows operating system, Apache web server, MySQL database, and PHP scripting language The LAMP software stack is a popular variation where Linux operating system is used. The Apple OSX MAMP software stack is nearly identical to LAMP, since OSX is a Unix implementation, and includes all the tools available in Linux. Questions?