------------------------- Week 11 Notes for CST8165 ------------------------- -Ian! D. Allen - idallen@idallen.ca - www.idallen.com Remember - knowing how to find out an answer is more important than memorizing the answer. Learn to fish! RTFM! (Read The Fine Manual) Keep up on your readings (Course Outline: average 4 hours/week homework) Review: ------ - handling multiple simultaneous connections - reducing indentation levels to make code readable - testing strategies - from Kurose/Ross: http://teaching.idallen.com/cst8165/07f/notes/kurose/ (HTTP slides) - includes HTTP slides showing Request and Response headers ------------------------------------------------------------------------------ HTTP - Hyper Text Transfer Protocol ---- First used in 1990 http://tools.ietf.org/html/rfc2616 (HTTP 1.1 - June 1999 - 176 pages) - a "PULL" protocol - receiver initiates (SMTP is "PUSH" protocol) HTTP design issues by Tim Berner-Lee --------------------------- http://www.w3.org/Protocols/DesignIssues.html Q: Why did Tim Berners-Lee choose "Internet Protocol" instead of RPC for HTTP? Q: Name one advantage and one disadvantage of coding HTTP using RPC. Q: Does the HTTP server need to keep state information about the client? Q: Why is the stateless nature of HTTP a problem for such things as search systems? How does Tim say the problems can be mitigated? Many of Tim's original methods (e.g. "PORT") didn't make it into the final HTTP specification. HTTP protocol consists of Requests and Responses ------------------------------------------------ http://tools.ietf.org/html/rfc2616 Requests - Section 5 Responses - Section 6 Unlike SMTP, the HTTP protocol is much more "symmetric" - the format of what the client sends to the server looks a lot like what the server sends back to the client. You can both upload and download using HTTP. An HTTP "Request" goes from client to server (from your web browser to the remote server). A Request consists of a series of header lines of the form "name: data" ending at an empty line (a line with just CRLF), followed by an (often optional) body. An HTTP "Response" comes back from the server to you (from the server to your web browser). Unlike SMTP, the Response has the same header and body structure as the Request. Q: What is an "HTTP Request"? an "HTTP Response"? Q: What is the format/structure of HTTP Requests and Responses? Sniffing Browser HTTP Requests and Responses -------------------------------------------- Since HTTP is a text-based protocol, you can use "netcat" to connect directly to an HTTP server, send a simple Request, and see what responses come back. Note the need for a blank line to end the Request: * $ nc -v google.ca 80 google.ca [64.233.161.104] 80 (www) open * GET / HTTP/1.0 * HTTP/1.0 302 Found Location: http://www.google.ca/ Cache-Control: private Set-Cookie: PREF=ID=4bacaba254d7fab1:TM=1174172556:LM=1174172556: S=F5pnjX7gt4IYGP2n; expires=Sun, 17-Jan-2038 19:14:07 GMT; path=/; domain=.google.com Content-Type: text/html Server: GWS/2.1 Content-Length: 218 Date: Sat, 17 Mar 2007 23:02:36 GMT Connection: Keep-Alive 302 Moved

302 Moved

The document has moved here. Sample HTTP "HEAD" and "GET" session: http://teaching.idallen.com/cst8165/07f/notes/http_session.txt Q: How can I use netcat to pull a Response from a remote HTTP server? To see what lines a browser sends to an HTTP server, you can use Ethereal; or, for a quick dump, just use netcat on a spare port (e.g. 55555) and have the browser access the port via http://localhost:55555/foobar : Start a fake netcat HTTP server on a spare port, e.g. 55555, then start up your browser and connect to http://localhost:55555/foobar and see what your netcat server reports: * $ nc -v -l -p 55555 localhost # Debian/Ubuntu * $ nc -v -l localhost 55555 # RedHat/Mandrake connect to [127.0.0.1] from localhost [127.0.0.1] 40757 GET /foobar HTTP/1.1 Host: localhost:55555 User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20060216 Debian/1.7.12-1.1ubuntu2 Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9, text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language: en-ca,en-us;q=0.9,en-gb;q=0.7,en;q=0.6,fr-ca;q=0.4, fr-fr;q=0.3,fr;q=0.1 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive At this point, you can type into the fake HTTP server netcat session and send HTTP Response lines back to your browser: * HTTP/1.1 200 this is my reply to the browser * Content-Type: text/plain * * ab * cd * ef * gh * ^C (interrupt) Your browser will show the above text. Q: How can I use netcat to show a Request from an HTTP client? Fetching a raw web page: wget ----------------------------- You can use "wget" to fetch the raw HTML from a web page, and options also let you see the header lines: $ wget http://idallen.com/ $ wget -O output_file -S http://idallen.com/ $ wget -O output_file --save-headers http://idallen.com/ $ wget --header="Host: teaching.idallen.com" http://idallen.com/ Q: How can I download the raw HTML from a web page to my current directory? HTTP is stateless; need session tracking ---------------------------------------- Unlike protocols such as SMTP, FTP, TELNET, etc., HTTP is completely "stateless". Nothing in the protocol links one request with another. Any need for "state", e.g. login credentials, shopping cart data, etc., has to be done outside the protocol. http://publib.boulder.ibm.com/infocenter/wchelp/v5r6m1/index.jsp?topic=/com.ibm.commerce.admin.doc/concepts/csesmsession_mgmt.htm "Web browsers and e-commerce sites use HTTP to communicate. Since HTTP is a stateless protocol (meaning that each command is executed independently without any knowledge of the commands that came before it), there must be a way to manage sessions between the browser side and the server side." http://java.sun.com/blueprints/qanda/client_tier/session_state.html "What are the client-tier mechanisms for storing session state?" - cookies - URL rewriting - hidden form fields "We do not recommend storing session state directly on the client using URL rewriting. [...] This section describes how to store session state directly on the client for those who choose to ignore these guidelines." "We do not recommend storing session state directly on the client using cookies. [...] This section describes how to store session state directly on the client for those who choose to ignore these guidelines." - Recommendation: don't save the actual state in the cookie or URL, save a session ID only: http://java.sun.com/blueprints/qanda/web_tier/session_state.html "A web container provides session management to the JSP pages and servlets it contains by way of interface HttpSession. Typically, the container will try to use a cookie to save user session state on the client. If the client refuses to accept the cookie for some reason (the user has disabled cookies, an intervening firewall filters cookies, etc.), the container will usually try to implement session management by using URL rewriting. URL rewriting works in cases where cookies will not, even in browsers that don't implement cookies, but suffer from other problems. Rewritten URLs tend to be long and ugly, are expensive to produce for pages with many links, and usually don't "bookmark" well. Furthermore, rewritten URLs usually can't be used with legacy web pages, because the URLs in the links in those pages are static." http://java.boot.by/wcd-guide/ch04s04.html "Given a scenario, describe which session management mechanism the Web container could employ, how cookies might be used to manage sessions, how URL rewriting might be used to manage sessions, and write servlet code to perform URL rewriting." http://www.brics.dk/~amoeller/WWW/javaweb/sessions.html - URL rewriting - hidden form fields - cookies Q T/F The recommended way to save HTTP state is to keep your state information in a client cookie Q T/F The recommended way to save HTTP state is to save only a state session ID in a client cookie Q: Why is session tracking needed on top of HTTP? Q: What is an HTTP "session"? Q: Name and describe briefly two of three possible ways to implement implicit HTTP session tracking Reading the HTTP RFC 2616 ------------------------- http://tools.ietf.org/html/rfc2616 ftp://ftp.rfc-editor.org/in-notes/rfc2616.txt Standards: http://www.w3.org/Protocols/ Errata: http://skrb.org/ietf/http_errata.html http://purl.org/NET/http-errata Issues: http://greenbytes.de/tech/webdav/draft-lafon-rfc2616bis-issues.html Mail Archives: http://lists.w3.org/Archives/Public/ietf-http-wg/ - HTTP is usually over TCP/IP, but any reliable protocol will do (p.13) Q: Does HTTP require a reliable protocol, or can it run over something unreliable such as UDP? - 1.0 required separate connections per request - 1.1 big change: allows chaining multiple requests per connection (p.14) Q: What big change did HTTP 1.1 bring to the HTTP "one connection per request" model of HTTP 1.0? p.15 - ABNF extended with a "#rule" for comma-separated lists: ( *LWS element *( *LWS "," *LWS element )) becomes 1#element - implied *LWS can appear between any ajacent tokens or strings in the grammar Q: Describe what this ABNF HTTP rule means: 2#3("foo") p.15-16 - HTTP ABNF grammar is unaffected by LWS between tokens - HTTP 1.1 lines can continue ("fold") onto multiple lines if the continuation line begins with a space or horizontal tab - the only CRLF allowed is part of a continuation line - if you want a real CRLF, or a non-ISO-8859-1 character, in a header field, encode it as RFC2047 (MIME) Q: How can you fold a long line in HTTP 1.1? p.17 - must double-quote special characters used in message headers - some fields allow comments in parentheses () Q: What do HTTP comments look like in message headers? - unlike SMTP, HTTP has a version number! (p.17) - URI "absolute" vs. "relative" paths (p.19, 36): "URIs in HTTP can be represented in absolute form or relative to some known base URI [11], depending upon the context of their use. The two forms are differentiated by the fact that absolute URIs always begin with a scheme name followed by a colon." p.19 An Absolute URI starts with "http:" and a Relative URI is anything else. Inside a web page, Relative URIs can have some forms not allowed in an HTTP Request. For the HTTP Request, Section 5.1.2 says you only have two real choices (the leading slash is required on the Relative URI): Absolute URI: http://idallen.com/foo.txt Relative URI: /foo.txt # an absolute path - proxy servers require ("MUST") absolute URIs ("http://...") (p.36) - note that "absolute URI" is not the same as Unix "absolute path"; - for a Request, a "relative URI" must be an "absolute path" and start with a slash "To allow for transition to absoluteURIs in all requests in future versions of HTTP, all HTTP/1.1 servers MUST accept the absoluteURI form in requests, even though HTTP/1.1 clients will only generate them in requests to proxies." Q: Give examples of HTTP absolute and relative URIs used in Requests. Q: Can a relative Request-URI (client Request to server) begin without a slash, i.e. can it be a relative pathname "foo.html"? (5.1.2 p. 36) Q: Can an HTTP client request an empty URI? (5.1.2) Q: T/F The HTTP is moving towards always using absolute URI's. (p.37) - path part of URI is case-sensitive; the host and scheme names are not (p.20) "When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs, with these exceptions:" p.20 Q: Which parts of an absolute URI are case-sensitive? - The HTTP protocol does not place any a priori limit on the length of a URI. - server may issue 414 (Request-URI Too Long) status (p.19) Q: What is the maximum length of a URI, as given in the HTTP spec? - HTTP headers can describe: - "content encoding" - a property of the original entity (p.23) - e.g. "gzip" - "transfer coding" - a property of the HTTP message (p.24) - e.g. "chunked" (transfer content in separate chunks, p.25) - may change how the entity is transferred Q: What is the difference between the "content encoding" header and the "transfer coding" header? - HTTP relaxes CRLF rule - allows consistent CR or LF or CRLF in text (but not in control sequences!) - 3.7.1 p.27 Q: T/F HTTP permits a client to send just CR or LF when communicating with an HTTP server (e.g. when sending a GET or HEAD request). - HTTP Request/Response messages do not use SMTP "continuation" method - message headers continue until an empty line: CRLF CRLF (p.31) Q: T/F The same generic HTTP message type is used both to send messages from client to server and from server to client. (section 4.1) Q: How do HTTP clients and servers detect the end of a series of message header fields (section 4.1)? Q: Is the CRLF at the end of the message headers optional? - leading empty lines preceding a Request or Response SHOULD be ignored (section 4.1, p.31) Q: Determine if google.ca, yahoo.ca, and facebook.com adhere to the above leading-blank-line SHOULD clause in section 4.1, p.31 - nc -v google.ca http OR telnet google.ca http - multiple message-header fields with the same name are allowed - but only if the entire field-value is a comma-separated list - should behave as if they were all on one long field (p.32) Q: T/F You can always send multiple identical message header fields; the HTTP protocol says they will be concatenated. - message body MUST NOT be included unless specifically allowed (p.33) - responses to "HEAD" MUST NOT include a message body (p.33) Q: T/F All HTTP Responses may include an optional message body. - HTTP Request and Response messages have the same general format: Request = Request-Line ; Section 5.1 *(( general-header ; Section 4.5 | request-header ; Section 5.3 | entity-header ) CRLF) ; Section 7.1 CRLF [ message-body ] ; Section 4.3 Response = Status-Line ; Section 6.1 *(( general-header ; Section 4.5 | response-header ; Section 6.2 | entity-header ) CRLF) ; Section 7.1 CRLF [ message-body ] ; Section 7.2 - "general header" fields apply to the message, not to the entity being transferred, and they can only be extended by a protocol version change (p.35) - "request header fields" - section 5.3 p.38 - can only be extended with a protocol change - "response header fields" - section 6.2 p.39 - can only be extended with a protocol change - unknown fields are treated as "entity header" fields - you can have custom "entity header" fields without a protocol change Q: T/F HTTP "general header fields" can appear in both Requests and Responses Q: T/F Unrecognized HTTP header fields are presumed to apply to the entity being transferred; they become "entity header" fields - unlike SMTP (HELO and helo), the HTTP "method token" (e.g. "GET") is case-sensitive and must be UPPER CASE ONLY (p.36) - but HTTP header field names in HTTP messages are not case-sensitive! (p.31) Q: T/F HTTP allows the use of either "HEAD" or "head" in a Request Line - servers MUST support at least GET and HEAD (p.36) Q: What method tokens are the minimum required of an HTTP server? - A big change made from HTTP 1.0 to HTTP 1.1 was the requirement that HTTP 1.1 Requests MUST include the "Host:" header to indicate the network location of the web server with which you want to communicate. (5.1.2 p.37, 9.0 p.51, 14.23 p.129, 19.6.1.1 p.171) - With the HTTP 1.1 "Host:" header, a single IP address can now serve multiple different web sites, each of which is at the same IP address but has a unique network location. - the network location in an absolute URI over-rides the "Host:" header (p.38) - an unrecognized network location MUST produce a 400 Response Q: If a client Request contains a host name in both the URI and the Host: header, which one has priority? Q: T/F If a URI or "Host:" header field specify a host name that is not recognized on this server, the server MUST forward the request to the other host name. (5.2 p.38) Q: List the names of the mandatory request header field(s) for HTTP 1.1 Q: T/F If you give the host name in a URI using HTTP 1.1, you don't need to send the Host: header field, the name in the URI is sufficient. HTTP Status Code and Reason Phrase - section 6.1.1 p.39 ---------------------------------- - a 3 digit Status Code, machine-readable, followed by a human Reason Phrase - only first digit has an assigned meaning (one of five) p.40 - five "classes" of response, based on the first digit (p.40) - 1xx: Informational - Request received, continuing process - 2xx: Success - The action was successfully received, understood, and accepted - 3xx: Redirection - Further action must be taken in order to complete the request - 4xx: Client Error - The request contains bad syntax or cannot be fulfilled - 5xx: Server Error - The server failed to fulfill an apparently valid request Q: What are the five possible meanings of the first digit of an HTTP response? Q: T/F The Reason Phrases given in the HTTP RFC are recommendations only; they MAY be changed or replaced with local equivalents without affecting the protocol. Q: T/F HTTP 1.1 clients do not need to understand the meaning all of the registered three-digit HTTP 1.1 status codes. Q: T/F An HTTP client MUST understand all five classes (first digit) of Status Codes. Q: If an HTTP server returns an unrecognized status code to a client, what SHOULD the client do with the response? (6.1.1 p.41) Entity (section 7 p.42) ------ - the "entity" is the thing being transferred, e.g. image, text, etc. - "entity headers" give information about the entity being transferred - may include "extension header" fields - unrecognized extension headers SHOULD be ignored - entity body has a length header and so is 8-bit clean (unlike SMTP) - but a transfer coding (chunking) may have been applied to assist transit - The sender of an HTTP 1.1 message SHOULD give the Content-Type - but if not (and only if not), the recipient MAY guess it by inspection (7.2.1 p.43) Q: T/F In the HTTP 1.1 protocol, senders MUST provide the entity Content-Type header field. Q: T/F A recipient may over-ride the Content-Type by inspecting the entity being transferred (or its URI). Q: If no Content-Type is specified, what type is assumed? (7.2.1) - the entity-Length of a message is calculated *before* transfer encodings have been applied (i.e. it is the actual length of the entity, regardless of how it might be altered to be transferred) - The Content-Length header, if present, MUST represent *both* the entity-length and the actual transfer-length. (4.4 p.33) - You MUST NOT send a Content-Length field if you apply a Transfer Encoding (because the Transfer Encoding might change the size). If a Transfer-Encoding field is present, you MUST NOT send Content-Length (because the Transfer Encoding method will specify the length). Q: T/F The Content-Length, if present, is both the real size of the item being sent and the size of the actual data being transferred. Persistent Connections (HTTP 1.1 - section 8.1 p.44) --------------------------------- - a significant upgrade from HTTP 1.0 - Persistent Connections - HTTP 1.1 connections default to persistent, even upon error (8.1.2) - persistent TCP connections have many advantages: - fewer TCP handshakes - reduced CPU, memory, latency - allow pipelining multiple requests without waiting for responses - longer connections allow better TCP congestion control - allows HTTP to evolve more gracefully - errors don't cause the connection to close - no penalty for trying a feature then dropping back to previous version Q: T/F HTTP implementations MUST implement persistent connections. (8.1.1) Q: T/F A persistent connection MUST drop on an error condition. (8.1.2) Q: Describe three of four advantages of persistent TCP connections (8.1) - a "Connection:" header field can ask for explicit connection closing: Connection: close Q: How can you signal the end of an HTTP 1.1 persistent connection? Q: T/F You signal the end of an HTTP session using the same keyword as SMTP - QUIT. - persistent connections require that all messages have a self-defined message length, so you know where the next message begins - you can't just end the message by closing the connection Q: Why do persistent connections need message lengths? - clients should not pipeline non-idempotent methods or non-idempotent sequences of methods, to avoid inconsistent state if the connection drops in the middle and the same request has to be sent again Q: Why not pipeline non-idempotent methods? (8.1.2.2 p.46) - HTTP does not define any time-out for persistent connections (actually, I can't find any time-out for *anything*!) - connection close events may happen at any time (asynchronous) - clients SHOULD limit to 2 the number of persistent connections to a server Premature Server Close - 8.2.4 p.50 ---------------------- - an issue with Internet protocols is: if the connection drops, when and how often do you try to get it going again? Try too often and you may contribute to network congestion. - HTTP "MAY" use "binary exponential backoff" of T = R * 2**N (p.50) Q: T/F HTTP client MAY double their wait times on each retry against an HTTP server.