A Central Caching Proxy Server for WWW users at the University of Melbourne


Daniel O'Callaghan, Department of Information Technology Services, The University of Melbourne, Parkville, Vic 3052, Phone +61 3 9344 8128, Fax: +61 3 9347 4308, E-mail <danny@www.unimelb.edu.au> Home Page: Daniel O'Callaghan [HREF 1]
Keywords: WorldWideWeb, WWW, Proxy, Cache, HTTP

Introduction

This paper explores the benefits of caching WWW transfers by an institution, to reduce traffic on external links, and to provide faster access to popular documents. The University of Melbourne has an institutional WWW proxy server, and is preparing to deploy second-level proxy servers in strategic locations throughout the University.

Background

The World Wide Web is enjoying a period of exponentional growth throughout the Internet. The impact of such growth is discussed in Challenges for Web Information Providers [HREF 2] by John December. In the 10 months from January 1st, 1994 to 31st October, 1994, WWW traffic on the Internet grew from 251 Gbytes to 2,005 Gbytes, an eight-fold increase, hence a doubling period of approximately 11 weeks (Merit NSFNet data [HREF 3]) . The Campus Wide Information System initiative at the University of Melbourne has produced a similar growth pattern of general WWW use, but with a doubling period of 5-6 weeks. Clearly, this growth rate cannot be sustained forever, but we cannot predict when it will begin to taper. Figure 1 shows the number of successful document requests made of the proxy in the six months between September, 1994 and February, 1995, graphed on a logarithmic scale, with the X-axis extended to the end of 1995. The prediction of 1 million requests per week by August 1995 is rather frightening.

To address the issues of bandwidth utilisation and cost of the traffic, the Department of Information Technology Services at the University of Melbourne has established a central WWW server to act as a proxy server for all WWW clients on the campus. The proxy server acts as a cache, keeping copies of fetched documents on disk, and supplying the local copy when appropriate, rather than fetching the original from the overseas site. Use of this system has resulted in a significant reduction in network traffic, and faster response times for University of Melbourne WWW users.

Methods

The caching proxy in use is the CERN [HREF 4] httpd [HREF 5] , developed by CERN in Switerland. It provides proxies for http 1.0, ftp and gopher0 protocols, as well as acting as a document server. The CERN httpd provides a number of options for fine tuning the characteristics of the cache, and allows the logging of transactions to separate files according to whether the document was retrieved from the remote site or from cache.

Initially, the cache size was set to 1.0 GB. HTTP, gopher and ftp documents were kept in the cache for 2 months or until space was needed for newer documents. The time to live for ftp documents in the cache was changed to 7 days on 10th February, 1995, to test the hypothesis that ftp documents used more disk space than was warranted by the traffic savings.

Results

WWW proxy statistics

In the week 29/1/95-3/2/95 University of Melbourne WWW users requested 41,000 documents totalling approximately 550 MB from WWW servers outside the University. 13,000 requests (31%) were satisfied from the University's proxy cache, 2,300 requests (6.5%) were satisfied by the clients' own caches (by users running Netscape), and 25,700 requests (63.5%) had to be satisfied by retrieving the original document. Approximately 75% of users in the University use the proxy to access the WWW.

By comparison, the week between 5/3/95-11/3/95 showed 83,446 requests for external documents totalling 1,013 MB. 36,301 requests (43.5%) were met by the cache; 3,991 requests (4.8%) were met by clients' caches, and 43,154 requests were satisfied by retrieving the original document.

The cache hit rate is dependent on the size of the cache and on the number of users using the proxy server, as Table 1 shows.

--------------------------------------------------------------------
| Date              Number   Cache Hits    Cache Hits   Cache Size |
|                  of users (% requests)   (% bytes)      (MB)     |
| 4th Dec,  1994      217       23           17           320      |
| 21st Jan, 1995      290       31           22         1,022      |
| 11th Mar, 1995      683       44           32           963      |
--------------------------------------------------------------------
Table 1: Growth in cache usage and effectiveness.

The data in the table is affected also by the decision on 10, February, 1995 to concentrate on http traffic caching, rather than treating all protocols equally. The time to live of an ftp document in cache was reduced to 7 days, while the http and gopher document time to live was maintained at 2 months. At the time, the cache contained 204 MB of ftp documents, 67 MB of gopher documents and 790 MB of http documents.

------------------------------------------------------------------------
| Method   C a c h e  H i t s    Cache         Megabytes   Megabytes   |
|         (%requests) (%bytes) holdings (MB)   Requested   from cache  |
| http      37.7       27.3       790            512.0       138.5     |
| gopher    10.7        6.2        67             20.6         1.3     |
| ftp        7.1        5.5       204             74.5         1.0     |
| All       34.1       24.3     1,061            607.1       143.8     |
------------------------------------------------------------------------
Table 2: Cache statistics for the period 5/2/95-9/2/95, and cache holdings at 02:42 on 10/2/95

The rationale for reducing the time to live of ftp documents in the cache was based on calculation of the dollar cost vs savings of caching the documents.

If one assumes costs of $0.55 / MB for fetched documents, and $0.02 per MB per week for document storage, one can calculate the net benefit of caching in purely economic terms. In the week 5/2-9/2/95, http cache savings amounted to $76.18 for a cost of $15.80, while ftp cache savings were $2.20 for a cost of $4.08. Clearly ftp documents cost more to store than the savings generated warrant. Converting the 204 MB devoted to ftp documents to http cache would be predicted to realise an additional $18 in http savings for a loss of $2.20 in ftp savings. It should be noted that these calculations do not include the intangible benefits of faster transfer rates, but since ftp fetches are only from cache in 7% of cases, reducing this figure will not impinge much on net performance from a user point of view.


------------------------------------------------------------------------
| Method   C a c h e  H i t s    Cache         Megabytes   Megabytes   |
|         (%requests) (%bytes) holdings (MB)   Requested   from cache  |
| http      49.1        36.8        890          826.1       304.0     |
| gopher     5.7         2.6         73           59.2         1.6     |
| ftp        8.0         1.5          7          111.6         1.7     |
| All       44.2        31.9        970          996.9       307.3     |
------------------------------------------------------------------------
Table 3. Statistics for the week 5/3/95 - 11/3/95.

Table 3 shows figures from four weeks after those in Table 2. HTTP requests were satisfied by the cache in 50% of cases, and the byte hit rate rose to 36% from 27.3%. The overall cache hit rate in bytes was at 31.9%, up from 24.3%, a significant improvement. As an aside, the number of users using the proxy rose from 353 to 629 in the same period, still representing only 5-10% of the potential users of WWW in the University.

Load on the Server

The graph in Figure 1, predicts that the users of the University of Melbourne proxy server will request 1 million documents per week by August, 1995. Assuming a 50 hour working week, to average out peak and off-peak times, this figure predicts a sustained connection rate of 5 per second, possibly higher during peak periods. The combination of current platform (DEC AXP 3000/300, 64 MB RAM, 6 GB HDD) with the CERN proxy software can probably handle a sustained load of no more than 5 connections per second. The CERN daemon forks a separate process for each request, consuming 3 MB of RAM and a significant amount of other system resources. Clearly, if the proxy service at the University of Melbourne is not to collapse under the demand, a strategy for spreading the load must be developed quickly.

Future Directions - Spreading the load

To spread the load of the proxy, a cascade of proxy servers is planned, and already users are configuring their browsers to take advantage of the local proxies when they are deployed. Seven local proxy servers have been defined by name, with all seven names pointing to the current central proxy server. When the new local proxies are deployed, the names will be pointed at the these machines rather than the central server. To maintain a consistent University-wide cache, the local proxies will use the central proxy, rather than contacting the remote site directly. Unfortunately, this does not remove the load of forking from the central server because each of the local proxies will send a "get if modified" request, to ensure that the document is up to date, each time a client requests it. This process is outlined in Figure 2.

Figure 2: The Flow of requests from a client, through two proxy servers, to the document source.

When the central proxy has multiple local proxies underneath it, the central proxy must handle the total number of client requests, while each local proxy handles only its own clients.

The solutions to handling this server load problem are:

  1. A more powerful central server.
  2. A method of spreading the load of the central server across multiple machines.
  3. A better proxy application program which uses multi-threading techniques rather than forking new processes.
Option 1 can be very costly in terms of hardware required. Option 2 is worth considering, although it, too, requires additional hardware. Option 3 is the best solution, as it makes more efficient use of computing resources.

Distribution of Requests.

Option 2 above, is most effectively brought about by sharing the load according to the domain of the requested document. Figure 3 shows the spread of top level domains of 30,000 document requests by University of Melbourne users in a one week period.

Figure 3: Distribution of proxy requests during a 1 week period at the University of Melbourne. The top level domain indicated is that of the requested document.

The domains .com, .edu and .au are the biggest sources of documents, and modifying the proxy code to use a different proxy for the appropriate domain would permit the load to be spread over four or more top level machines. Such a topology, illustrated in Figure 4, would enable neighbouring institutions to share a set of proxy servers, taking advantage of the combined caches, without swamping a single parent proxy server with every request issued by users in the group of institutions. An alternative proposal is a system where neighbour proxy servers can query each other for documents. Such a system has been developed - see Harvest Cache below [HREF 6].

Figure 4: A scheme for spreading the proxy load across servers dedicated to separate domains.

The modifications to the proxy code to perform the appropriate discrimination based on domain of requested document have not yet been made, but they are not anticipated to be problematic.

Alternative proxy server programs and algorithms

The CERN httpd is not the only WWW proxy server, but it is the most widely used. Others include:
Lagoon [HREF 7]

Developed at the Technische Universiteit Eindhoven. The authors themselves state that Lagoon is simply an alternative to the CERN daemon.

web-proxy

Duane Wessels <wessels@colorado.edu> of the University of Colorado has developed web-proxy, a proxy program which uses multi-threading techniques to handle multiple simultaneous connections with a single daemon process, rather than forking a new copy of the proxy process for each client connection. The reduction in overhead on the proxy computer is approximately 80%. Thus, the projected maximum sustainable load for the University of Melbourne would rise from 5 connections per second to nearly 30 connections per second, based on figures quoted in Dr Wessels' PhD thesis [HREF 8] . Dr Wessels' also promotes the idea of long-term and short-term caches, based on a study of time between first retrieval and cache retrieval of http documents, and he developed a mechanism for communication between document servers and caching proxy servers.

Harvest Cache

The University of Colorado and the University of Southern California have collaborated to develop the Harvest Cache [HREF 6] , a proxy-cache application which allows parent and neighbour proxy servers to query each other for documents. A document is fetched from the closest neighbour which holds the document, calculated on network round-trip times for a 'ping'. This algorithm allows for distributed top-level proxy servers, without resorting to configuring second-level proxy servers to select a top-level proxy by the domain of the requested document's location. The same team has developed an http accelerator [HREF 6] program which can reduce the load on a server by providing a multi-threaded front-end to any WWW server, which can reduce the load on servers by a factor of 200.

Conclusions

This paper gives an insight into the necessity of caching WWW documents, and also the difficulties in installing a centralized WWW proxy server for a large institution. Proxying and caching will play a very important part in preventing the Internet from collapsing under its own weight, as the WWW user base and document base grow at an alarming rate. Institutional and regional proxies will become necessary, but the design of these systems must allow for at least a 10-fold increase in demand per year, based on NSFNet statistics, or 100- fold increase based on University of Melbourne growth in demand. The deficiencies in the present generation of proxy applications has also been mentioned. Fortunately, the next generation of applications shows promise in being able to sustain the load which a large user-base will demand of an institutional proxy server.

Hypertext References

HREF 1
http://www.unimelb.edu.au/~danny/ - Dr Daniel O'Callaghan's home page
HREF 2
http://sunsite.unc.edu/cmc/mag/1994/oct/webip.html - Challenges for Web Information Providers by John December
HREF 3
ftp://nic.merit.edu/statistics/ - NSFNet Statistics Archive
HREF 4
http://www.cern.ch/ - The Home Page of CERN
HREF 5
http://www.w3.org/hypertext/WWW/Daemon/User_3.0/ CERN httpd Documentation
HREF 6
http://excalibur.usc.edu/ - Harvest Cache and HTTPD Accelerator Project
HREF 7
http://www.win.tue.nl/lagoon/ - Lagoon, from TU-Eindhoven
HREF 8
http://morse.colorado.edu/~wessels/Proxy/ - Web Proxy, by Duane Wessels

Copyright

© Southern Cross University, 1995. Permission is hereby granted to use this document for personal use and in courses of instruction at educational institutions provided that the article is used in full and this copyright statement is reproduced. Permission is also given to mirror this document on WorldWideWeb servers. Any other usage is expressly prohibited without the express permission of Southern Cross University.
Return to the AusWeb95 Table of Contents

AusWeb95 The First Australian WorldWideWeb Conference