Transparently Guaranteeing Fault Tolerance, Geographic Affinity, and Load Balanced Mirror Sites

Brian Wingenroth, Web Development Specialist, Johns Hopkins University Press [HREF1], Baltimore, Maryland, US. brian@muse.jhu.edu

Thomas Maszczenski, Unix Systems Administrator, Johns Hopkins University Press [HREF1], Baltimore, Maryland, US. thom@muse.jhu.edu

Abstract

In theory, content providers use mirror sites to enhance the reliability, efficiency and flexibility of their web sites. However simply setting up a mirror site guarantees none of these benefits. Each mirror site is configured with a different Internet Protocol (IP) address and possibly a different domain name. The only way one can access the mirror is by knowing its IP address a priori. Various solutions to this problem have been proposed, yet each has unique flaws. This paper presents another solution, that utilises communication between mirror sites to enable transparent failover in the event of a single site failure, transparent redirection of a user to the closest copy of the web site, and the incorporation of load balancing to distribute requests so that a single site is not disproportionately overburdened. We propose a solution that demands nothing of the end user and very little from the content provider. Our software solution can be implemented at any web site, realising the potential benefits of a mirror site.

Introduction

Geographic distribution of web content via mirror sites is vital to ensure availability and speed of a resource. At Project MuseTM [HREF2] [HREF3] we have two mirror sites: one at Johns Hopkins University in the United States and another at the University of Queensland in Australia. By having these geographically dispersed content sources, we aspire to not only guarantee availability in the event of a failure without asking anything of the user, but also to transparently redirect the user to the closest mirror site, thereby providing the most responsive browsing experience. Contemporary internet protocols do not inherently provide adequate means to achieve these goals.

Currently, when a user visits a web site, he issues a request containing a Uniform Resource Locator (URL). Within this URL are a server name and a document path. The inherent limitation of this exchange is in the server name. Locating the machine referenced by this server name is a task for the Domain Name System (DNS). In the DNS, the server name commonly points to one Internet Protocol (IP) address. If a site is down, the only way for a user to locate the mirror site is to know either the domain name or the IP address of the mirror site. While the DNS can be configured such that the domain name points to multiple IP addresses, the result of this configuration is for the IP addresses to be issued to users in a round-robin fashion (Albitz and Liu, 2001). This means that if we have n servers containing our content, every nth user will be directed to the same machine and each machine will receive "one-nth" (1/n) of the requests. While this is an improvement over having every request serviced by a single machine, it suffers a few shortcomings. If one machine fails, every request for that machine will fail. Consider a web site with four replicated mirror sites. With one failure, even if there are three running machines with current and perfectly usable copies of the web site available, one-fourth of the requests to this web site will fail since exactly one-fourth of the requests to the DNS will point to the IP address of the machine that is down.

Keeping with this sample web site for a moment, let us say that two of these machines are in the US, one is in Australia, and one is in Tanzania. The round-robin technique used in the DNS ignores geography completely. If there were four users in Sydney accessing our web site consecutively, only one of these four would connect to the mirror site in Australia. Two would access the site in the US, and the other the site in Tanzania.

What we want is the ability to do three things: first, to have the DNS failover to a second IP address in the event either mirror site is down. This would guarantee availability of our web site as long as a single mirror site is available. Second, we want to redirect users to the closest copy of our site based on their geographic location. For example, users in Australia should reap the benefits of our having a mirror in Australia. Finally, it would be ideal for us to balance incoming requests to distribute the load among the mirror sites. After identifying further motivation for our work, we identify previous work on these problems and address each of these issues in detail.

Motivation

The current state of the web precludes us from achieving the results we desire. Specifically, in the DNS, we have the option of issuing multiple IP addresses in round-robin order, but this is suboptimal for the reasons discussed above. Furthermore, this in no way guarantees the most responsive connection for a user.

To determine the geographic origin of the user's request, the best we can hope for is a reverse hostname lookup. In the access logs for a web site we store the IP address of every visitor. A reverse hostname lookup yields the name registered in the DNS for that address. This is useful in determining a geographic location when the hostname ends with a country code top-level domain (TLD). Alternatively, we can utilise the various WHOIS databases [HREF13] to supplement the data of a reverse hostname lookup so that we know the country through which the user is connecting. With this information we can assign certain mirror sites to service the requests from certain countries. While useful for the obvious cases, e.g. servicing all Australian users with a mirror in Australia and all users from North America with a mirror in the United States, it does not take into account any performance metric at all. Any user in between would be assigned almost haphazardly to a mirror that is in no way proven as the best choice. Any method of slicing the world in half is arbitrary at best and can in no way guarantee an optimal connection.

Load balancing has been studied by many. Detailed discussion of this can be found elsewhere (see [HREF5], [HREF6]).

Related Work

One solution to the geographic affinity problem, proposed by the Center for Networking and Distributed Systems at Johns Hopkins University (CNDS), takes advantage of a DNS implementation that consults the most responsive of two authoritative servers (Amir et al., 1998). This proposal depends on every implementation of the DNS mimicking the behavior of BIND (Berkeley Internet Name Domain) which records the round-trip time (RTT) to each authoritative nameserver and continues using the authoritative nameserver with the lowest RTT (Albitz and Liu, 2001). Using this approach requires creating an authoritative source of our DNS record at each mirror site. The hope is that clients closer to a mirror in Australia would consult the authoritative DNS server in Australia. This proposal is very dependent on a particular implementation of the DNS and thus not broadly applicable.

Other work at the CNDS has attempted to address this problem in more depth. Amir et al. (1999) address the three points of transparent failover, geographic affinity and load balancing. Their solution does address transparent failover in similar fashion to ours, but requires a more demanding architecture. Their solution to the geographic affinity problem is based on their work discussed above; consequently it suffers the same limitations. Amir et al. (2002) again addresses failover. In the event of failure, the solution they propose dynamically allocates the publicly known IP address for the web site to an available host. A problematic prerequisite to this proposal, as they point out, is that any machine capable of assuming the publicly known IP address must exist on the same network. This is not always feasible -- particularly when the mirror sites are located on different continents.

Karaul et al. (1998) propose addressing these problems on the client-side. They argue that a scalable solution to finding the best mirror site is best achieved by the client using their WebSeAl software. Any client side solution suffers the limitation that it requires widespread adoption of a particular piece of software. Though this adoption can be achieved either by individual users installing the software or by browser vendors including it in their products, it requires a content provider to depend on client-side adoption.

One comprehensive solution that addresses failover, the geographic affinity problem and load balancing is provided by Akamai Technologies, Inc. Akamai addresses these problems with "more than 13,500 servers in more than 1,000 networks in 66 countries" (Akamai, 2002). By placing a content server with each customer's web site in the local network of many ISPs, they can guarantee a content source near every user. Our work attempts to address these problems in a manner that, while more intensive on the part of the content provider, can be implemented by anyone with distributed sources of content.

There are other commercial solutions to the geographic affinity problem. Digital Envoy [HREF11], Quova [HREF7], MaxMind [HREF9] and CyberSource [HREF10] are four such companies with products that address geo-location. The goal of these products is to pinpoint the physical location of the user to some certain granularity -- be it country, city, or a longitude and latitude. This information is valuable for detecting credit card fraud, or charging subscriptions to overseas customers [HREF8], but geo-location solutions only reduce the problem to one of slicing the world in arbitrary sections. There is no guarantee of delivering the closest copy.

The remainder of this paper will address the first two of these issues. We explicate our solution to the transparent DNS failover problem in detail. A novel solution is proposed to address the geographic affinity problem. Load balancing is an issue we have left to others due to space considerations. See [HREF5], [HREF6] for excellent treatment of this topic.

Transparent DNS Failover

Identifying ways to assure availability of a domain name in the event of a server failure is not new (see [HREF4] for a similar solution), but we detail our version here since it differs from any we have seen. Assume there are two servers in different geographic locations. Each of these servers polls the other server, checking that the replicated resource is available. It would be insufficient to merely ping a server, as the server could be alive while the service is failing. Upon recognizing a failure, the DNS is updated pointing the domain name to the active server's IP address. This is an easily automated task, assuring that within minutes of a failure, the failover can occur without a user noticing the change. The only requirement for implementing such a solution is that the resource provider must control the DNS record for its domain names. The DNS is hosted authoritatively in both locations (Albitz and Liu, 2001).

The only downtime for the site is the amount of time it takes for the change in the DNS to propagate across the Internet. This time is bounded by the time-to-live (TTL) set in the DNS record for the domain.

Geographic Affinity

A solution to the geographic affinity problem directs a user to the closest mirror site. Our metric for closeness is not physical distance, but the shortest round-trip time. There are two assumptions that are worth noting. One is to recall the structure of the Internet. All hosts on the Internet are interconnected -- much like a graph. Ergo, there are many paths to get from one host to another. Given two mirror sites, we are interested in fastest route from a user to any one mirror. The second assumption is that all users who connect to the Internet do so through a service provider, be it a commercial ISP, a campus network, or a corporate network. The utility of this assumption lies in recognizing that it is not essential to find a precise route to an individual user. It is sufficient to find the route to their service provider's network. In other words, it is not necessary for us to pinpoint that a user on a corporate network in Sydney is in building B, floor 2. Knowing that this user is connected through the company's network is enough.

Tracking the "distance" between networks is straightforward using a tool like traceroute. The data provided by traceroute gives us a list of routers in the path each network packet traverses and the RTT from the source host to each of these routers. From these data we can record a reasonable estimate of the RTT between each network in the route.

The Database

Database Schema
Figure 1 - The database schema

Recording the RTT data is kept purposefully simple in a database. We require two tables: a network table to record distinct networks (or IP addresses), and a link table to record the RTT between any two networks (see figure 1). Populating these tables is straightforward. We already have a list of IP addresses available in our log files -- this populates our network table. For our purposes we can consider a single IP address functionally equivalent to a network. This assumption is justified because the endpoint of a traceroute is likely to be the router through which our target address and others on the same network access the Internet. In an event such as the one just described, where we never reach the target IP address, we take the data for the last reached router. We assume that this router will be the endpoint of a traceroute from any of the mirror sites. Thus, finding the smallest RTT to this router is equivalent to finding the closest copy of the web site.

For each entry in the network table, we execute a traceroute from each of the n mirror sites. By parsing the result of the traceroute we obtain enough data to populate the link table. Each entry in the link table has an expiration date, after which the traceroute is executed again to maintain the most current data.

Architecture

This scheme is a distributed system by its nature. The population of the data is a responsibility of each of the mirror sites. We choose to insert a machine, the request distributor, in front of the web server (see figure 2).
Architecture for a single web site
Figure 2 - Architecture for a single web site
This machine collects the RTT data and communicates with its peer at the other mirror locations to ensure that the data are collected so the client is directed to the closest copy.

Each request distributor receives the web requests intended for the web content server directed to it by the DNS. Knowing the location of all the other existing copies of the resource, it can then query the database to find which copy is closest to an incoming client. The client is redirected to the closest available copy.

When a new client, or one for which data are not currently available, requests the web site, the client is simply forwarded on to the copy associated with the machine receiving the request. A message is then sent between geographic sites queueing this client's address so it is included in the data set. This ensures that data are collected for this network. Subsequent requests from this network can then be handled accordingly.

Localised Communication

Communication between peer request distributors utilises web services. When an incoming request is received for which data does not yet exist, the IP address is added to the network table and a message is sent to all peer distribution nodes requesting that a traceroute be executed. The results of this are communicated between peer nodes. Updates to any data are handled similarly using this communication infrastructure.

Global Participation

While this scheme is useful for an isolated set of mirror sites, it also has enormous potential on a global scale. Consider the scheme described above augmented by a set of parent nodes (see figure 3).
Global Architecture
Figure 3 - Global Architecture
Local request distributors would pool their data at a parent node (or master server) creating a globalised copy. There would be a set of master servers that communicate amongst themselves maintaining global versions of the network and link tables. An interested participant in the scheme could then query any master server for a copy of the link table containing only rows relevant to the set of replicated servers pertinent to the participant. This optimisation reduces the network traffic between a master server and a local request distributor.

The advantages to a global scheme are compelling. Multiple participants guarantee greater coverage of the Internet and thus a more complete data set. The new client problem is also simplified. There are in practice a set of servers, the root DNS servers for example, from which we can collect RTT data for the link between each network and the well-known site. Thus when RTT data between the mirror site and the client is not present in our data, we can obtain a value for this RTT using at most two links -- one from the mirror site to the well-known site and one from the client to the well-known site.

Globalised Communication Between Nodes

A significant challenge to the globalised version of this scheme is minimising the network traffic required for it to function. Transferring the entire database that could contain entries for every IP address and an entry for the RTT between every two IP addresses on the Internet at once is certainly prohibitive.

To remedy this, updates are event-driven -- that is, when any node in the scheme receives an update, it is propagated to each node for which a communication path exists. The impact of the data transfer is distributed over time, easing the effects on the network.

Paths of communication exist between a request distributor and a master server, and between the master servers. Master servers communicate with each other to maintain synchronised copies of the data. A versioning system like that explained by Ozsu and Valduriez (1999) uses timestamps to keep the master server copies current. A request distributor, knowing the set of IP address of its peer replicas, needs only to keep current a copy of the rows in the link table containing the addresses of its peer replicas. Intermittent polling of a master server should be sufficient for this task.

Future Directions

Maintaining the scalability of this scheme is an important consideration. On the horizon is the widespread adoption of Internet Protocol Version 6 (IPv6). One major impact IPv6 will have is extending the IP address space by a significant factor -- from a space of 4.2 x 109 addresses to one containing 3 x 1038 addresses (Morton, 1997). The effect of this transition is only relevant to our scheme in terms of the size of our data set, that is, the size of our network and link tables will need to grow to accommodate the proliferation of addresses. While this is a notable concern, all data sets containing IP address information will need to scale to meet this requirement.

One additional goal we wish to explore is the implementation of our geographic affinity solution in an Apache handler. Doing so eliminates the hardware requirement of the scheme as it is laid out above. While this ties in to one specific web server the ideas presented herein are general enough to be implemented for any platform running any web server.

Conclusion

This scheme, as laid out above, requires nothing of an end user and very little of a content provider. To provide transparent failover in the DNS, a content provider needs authority over the domain names for the web site. Software is then installed at each mirror site to monitor the other servers. Upon detection of a failure, the DNS is updated (Vixie et al., 1997) without manual intervention. The total downtime for the web site is thus the time until detection of the failure plus the time necessary for the DNS update to propagate across the Internet. To provide better utilisation of a mirror site, we have proposed a software solution that redirects users to the geographically closest (or most responsive in terms of RTT) copy of the web site. In this scheme the user requests the same well-known URL and this request is transparently serviced by the "best" mirror site. Incorporating load balancing, while not addressed in this work in detail, is straightforward. The mirror sites use the communication infrastructure outlined herein to exchange data about server load and available resources.

Significant features distinguish our solution from previous work. Our solution is simple to implement. It exists completely in software. It is open source [HREF12] and platform independent. Given a sufficient number of participants, it has enormous potential on a global scale. This is a simple, open-source implementation that transforms a set of under-used mirror sites into a reliable, flexible, and efficient globalised web presence.

References

Akamai Technologies, Inc. (2002). Applications for Akamai EdgeScape. Akamai White Paper.

Albitz, Paul and Cricket Liu. (2001). DNS and BIND, Fourth Edition. Sebastopol: O'Reilly & Associates.

Amir, Yair, Alec Peterson, and David Shaw. (1998). "Seamlessly Selecting the Best Copy from Internet-Wide Replicated Web Servers." Proceedings of The 12th International Symposium on Distributed Computing (DISC'98) (formerly WDAG), Andros, Greece.

Amir, Yair and David Shaw. (1999). "WALRUS - a Low Latency, High Throughput Web Service Using Internet-wide Replication." Proceedings of the 19th IEEE ICDCS Workshop on Electronic Commerce and Web-based Applications, pages 31-40, Austin.

Amir, Yair, Ryan Caudy, Ashima Munjal, Theo Schlossnagle, Ciprian Tutu. (2002). "N-Way Fail-Over Infrastructure for Survivable Servers and Routers." Technical Report CNDS-2002-5.

Karaul, Mehmet, Yannis A. Korilis, and Ariel Orda. (1998). "A Market-Based Architecture for Management of Geographically Dispersed, Replicated Web Servers." ICE 98, Charleston, SC.

Morton, David. (1997). Understanding IPv6. http://www.itp-journals.com/nasample/c0655.pdf

Ozsu, M. Tamer and Patrick Valduriez. (1999). Principles of Distributed Database Systems, Second Edition. Upper Saddle River: Prentice Hall.

Vixie, P., S. Thomson, Y. Rekhter, and J. Bound. (1997). Dynamic Updates in the Domain Name System (DNS UPDATE). RFC 2136.

Hypertext References

HREF1
http://www.press.jhu.edu/
HREF2
http://muse.jhu.edu/
HREF3
http://muse.uq.edu.au/
HREF4
http://www.columbia.edu/~alan/akamai.html
HREF5
http://www.onjava.com/pub/a/onjava/2001/09/26/load.html
HREF6
http://www.webtechniques.com/archives/1998/05/engelschall/
HREF7
http://www.quova.com
HREF8
http://www.quova.com/solutions/times_online.html
HREF9
http://www.maxmind.com
HREF10
http://cybersource.com/products_and_services/verification_and_compliance_services/geo-location
HREF11
http://digitalenvoy.com
HREF12
http://www.opensource.org
HREF13
http://www.allwhois.com

Copyright

Brian Wingenroth, Thomas Maszczenski and The Johns Hopkins University Press, © 2003. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.