Lloyd Sokvitne, Manager, Tasmania Online[HREF1], State Library of Tasmania, 91 Murray St., Hobart, Tasmania, Australia, 7000 lloyd.sokvitne@central.tased.edu.au
This paper discusses the conceptual and practical issues that relate to the preservation of World Wide Web resources. It is argued that the Web deserves preservation and that the technical problems can be overcome. The experience of the State Library of Tasmania through its Our Digital Island preservation Web site is used to illustrate that Web preservation is practical and affordable. National and cooperative action by relevant organisations will be needed to ensure that signficant Web preservation occurs in the future.
In this paper I propose to discuss the issues of World Wide Web preservation from a practical and pragmatic point of view. As Eric Wainwright [HREF2], then Deputy Director-General of the National Library of Australia, said about digital preservation in 1995: "There is an enormous amount of research going on, but we canít afford to wait for the results of the research. Actions are needed now." [emphasis from original paper]. The rate of growth in World Wide Web content in the past four years has compounded this need for immediate action.
Disappointingly there has been little widespread activity in Web preservation since its inception and there appear to be three basic reasons for this. Firstly, it is easy for organisations to dismiss the World Wide Web as unworthy of serious preservation. Secondly, the World Wide Web is seen as technically difficult to capture and preserve. Thirdly, there are no clear lines of responsibility as to who should capture and preserve Web content.
The rationale for preserving content on the World Wide Web is based on the inherent value of this content, and on the importance of the Web as a medium for intellectual and creative activity. Perversely, it is also the range of the content on the Web and the nature of the medium itself that are often used to justify doing little to preserve it.
The concept that the products of human intellectual and creative endeavour should be preserved for future generations is generally accepted by society. In other words there is a value placed on the preservation process. It is when one attempts to provide resources for the processes of preservation that difficult issues begin to emerge. Preservation has a cost as well as a value.
Preserving World Wide Web content has a cost, both in absolute terms, and in the reprioritization that must often occur if an organisation is already involved in other preservation activities. Against this environment, it is tempting to say that most of the Web is "ephemera" and that the costs do not justify the expense.
The range of material on the World Wide Web ranges from the sublime to the ridiculous. There are examples of creative genius as well as trivial banality, but it is the banality that receives the widest attention. The phrase ësurfing the Internetí reflects the common view that Web content is light and trivial, and that access to it can be treated as a casual activity rather than serious endeavour.
Most preservation agencies will agree that it is extremely dangerous for the current generation to decide what type of resource will be important to future generations. While it may be difficult to decide what is valuable in todayís environment, using criteria that we understand from our own time; it is impossible to decide what will be valuable in the future when the criteria used may be totally different. Not only should a range of views as to what is valuable be considered and included, resources with unlikely value should be included as contextual matter. The range of such material itself presents a backdrop that improves understanding and adds meaning.
Given the range of available Web content, most could agree that some of the information contained on the Web is of value, adds to our informational and cultural store, and reflects the intellectual activity of the age. Preservation activities would have developed that could cope with this range if the medium itself had not posed additional problems.
It is often the case that society reacts slowly to the preservation needs of new media. The loss of moving picture stock from the first decades of this century is a clear example of this. The World Wide Web represents a new methodology for the creation and delivery of ideas, and is quite different from the hard-copy print model that permeates many preservation activities. The precedents from print publishing tend to place importance on physical outputs and a process-driven publishing environment, and down play the importance of easy, spontaneous, and interactive publishing formats such as those available over the Web.
However, it is these characteristics that make the World Wide Web worth preserving. Self-publishing, graphical content, interactivity, and interconnectivity make the World Wide Web a totally new type of communication medium. The ease of publishing should be seen as an agent for democratisation, not as a force for trivialisation. The multi-dimensional aspects of the World Wide Web offer new tools to express ideas that could not be achieved using a static print format, and that cannot be judged by old standards.
It can also be argued that the rate of change for the World Wide Web is so rapid that it will have evolved into something quite different within just a few years. However this type of rapid change is a good reason to increase preservation activities. This is because there is definite value in understanding the evolutionary stages that produced the final product. Preserved examples of how the Web is used at any point of time, eg. today, will have value to future generations in understanding the developmental process that led to its successor.
Finally the World Wide Web can be seen as just one aspect of a broader digital and electronic preservation problem. Comprehensive preservation policies would be ideal and are under development by bodies such as the National Library of Australia [HREF3]. Some of these strategies attempt to provide a unified approach to a range of digital preservation needs, from born digital objects to scanned or converted resources. Issues covered often include quality assurance programs, choice of long term storage media, software migration, and long term access and retrieval methodologies.
Partial preservation strategies, based on representative or even opportunistic solutions, have at times got lost in the huge range of tasks that are needed to resolve these overall problems. However, it is my view that any strategy that provides a partial solution and outcome for any aspect of digital preservation is worth pursuing. Put simply, it is better to do something than to do nothing when the medium is as dynamic as the World Wide Web.
Current print preservation strategies have tended to focus on issues caused by the use of acidic paper in the publishing industry and the need to preserve over a hundred years of print production since the adoption of acidic paper. Paul Conway has described this problem evocatively as dealing with the "slow fires" of deterioration [HREF4]. However the rate of loss of World Wide Web resources is far more rapid, and could be described as a ëraging infernoí. For example, the recent evidence provided by Wendy Smith in LASIE [HREF5] gives a worryingly high estimate for the rate of loss in information on the World Wide Web. And as Wendy Smithís article shows, it is the change in content as much as the disappearance of the Web pages themselves that makes the need for preservation urgent.
Experience by the State Library of Tasmania in helping non-profit community groups publish on the Web illustrates this. The publishing guidelines for this service recommend that community groups should update their Web pages on a regular basis, with updates at least every two months. The view of the State Library as adviser to these groups is that the provision of static information on the Web is to miss the opportunities that Web publishing provides.
It is tempting to discount the practicality of Web preservation because Web resources are dynamic entities that can link to a range of external resources that are themselves dynamic and volatile. In addition a Web resource may reflect a particular but transitory state of software development; and comprise a complex set of user interactions that must be captured if the page is be fully understood or appreciated.
Undoubtedly these problems make a World Wide Web page more difficult to preserve than a simple physical object such as a letter or a book, but each of these problems can be managed. At its simplest level, it is physically possible to transfer web pages or web sites from their home location onto preservation servers. File transfer can do this over the Internet from the publisher to the preservation site, by remote capture software operated by the preservation agency, or even by simple floppy disk or magnetic tape transfer. Ancillary software or resources can be captured as well. Certain organisations [HREF6] are even actively and aggressively capturing Web content on a very large scale using trawling software.
It is easy to summarise the problems of dealing with an evolutionary format such as HTML and with variable and inconsistent delivery mechanisms such as desktop browsers. There have been four versions of HTML, with XML likely to emerge in the near future. World Wide Web file formats can vary from compressed ZIP files to PDF and embedded content can range from GIFS to Quicktime video to Shockwave. Desktop delivery software can range from Lynx and Mosaic, from Netscape to Internet Explorer, Opera and many more. These problems are compounded by the fact that any single platform is not necessarily forwardly or backwardly compatible.
There appear to be three basic ways to solve this problem: repeatedly migrate content onto new software platforms, retain the necessary software/hardware platforms, or develop backwardly compatible emulators or viewers. Douglas Kranch (1998) has imaginatively suggested that near-future technology could produce ëdigital tabletsí: self-contained devices that carry both the electronic document and the software to use them. Although it is true that none of these options are currently in place, this should not lead to the conclusion that Web preservation is not viable or that preservation should wait until the solutions are produced. Web documents once preserved become candidates for future migration; Web documents that are not preserved are lost forever.
It may eventuate that a particular preservation activity preserves a resource that cannot be accessed in the future, or that there may be no demand for that particular resource by future generations. One could say, in this situation, that the preservation activity was not effective because it did not produce a result. However, this is not to say that the preservation process did not have value. It will only take a small proportion of digital resources to successfully translate into the future to make the whole process worthwhile.
An author may wish to preserve a Web page for their short term needs or the publishing organisation may need to preserve a Web page for its short to medium term business requirements. These processes are important and a number of important policy statements on digital preservation over the past few years have placed great importance on the role of authors and publishers in digital preservation. However, it is preservation by third parties with business goals directly related to preservation where real results will be produced. Authors and publishers, I believe, can at best assist in the process of preservation by a third party, but they do not have the outlook or the resources to produce long term preservation outcomes.
Libraries have had a long standing role in the preservation of cultural and intellectual resources related to the printed word. The format of the printed word has changed over time, from books to data, and the library has followed pace. The World Wide Web may present significant practical problems, but it should not present a conceptual problem to the library in so far as preservation is concerned. Importantly, it means that the library can also devote resources to the process and ensure the continuity of the preservation process.
The library as an organisation is also well placed to deal with the copyright issues that must be considered when dealing with Web preservation. A World Wide Web page or site is protected by copyright in just the same way as any other print resource. Issues related to copying and re-use are particularly important for Web preservation because re-use in an electronic networked environment can actively affect the effectiveness or status of the original object. Publishers will be concerned, particularly with the growing commercialisation of Web content, to ensure that their competitive advantages or business outcomes are not adversely affected by the preservation of their Web content on third party sites.
These are issues that the library is well situated to resolve because of its experience in copyright matters and its acceptance of the principles behind copyright. The library has a valuable reputation as an objective third party when it comes to copyright. The library also has a non-commercial status that allows the business sector to believe that library-based Web preservation will not lead to negative commercial outcomes. For example, the National Library of Australia, through the PANDORA Project [HREF7] has had a positive experience in dealing with publishers with relatively few problems in gaining their permission to capture and preserve Web resources.
In Australia, libraries have traditionally been assigned various roles in the preservation process, with the strongest responsibility given to State Libraries and the National Library. Most other library types have been assigned very small roles in this preservation environment. Web preservation will challenge this role assignment because of the magnitude and breadth of Web publishing. All major libraries will need to fulfil preservation functions in the future, and act as part of a well-defined and cooperating group of state and national libraries. A major 1996 report on digital preservation commissioned by the Commission on Preservation and the Research Libraries Group [HREF8] recommended a national system of digital archives. In Australia this would require a re-engineering and resource re-allocation process that would not come easily for many libraries, but it is something that must occur if Web preservation activities are to yield meaningful results.
A number of other organisation types have a strong role in informational and cultural preservation, notably archives, museums, and art galleries. But it appears that, like libraries, very few such organisations are actively dealing with Web preservation.
There are many facets of Web content that suggest preservation responsibilities can be shared across organisation types. For example, electronic resources delivered over Intranets would seem to naturally fall within the scope of Archival bodies, artistic and creative resources on the World Wide Web may be of most value to art museums. Assigning such responsibilities is a task where initial dialogue between the various organisations should occur as soon as practical. But this is not to suggest that more coordination and planning is the only answer. Any organisation of any type that attempts Web preservation is meeting one of the fundamental tenants of this paper: to do something is better than to do nothing.
Libraries and other organisations must first be able to identify what should be preserved before the process can even begin. Given the size of the Web and its lack of geographic and subject boundaries, libraries will have to carefully consider the scope of their preservation strategy and the extent of resources that can be diverted to such a process. Existing selection policies will need to be expanded to include an assessment and collection strategy for Web preservation.
It is important to note, however, that the Web does offer an alternative. It is possible to adopt a comprehensive collecting strategy that uses robots to traverse the Web and collect everything relevant that they find. For example, the National Library of Sweden's Kulturarw3 Project [HREF9] uses this methodology to capture Swedish material. This approach can avoid difficult decisions as to what to include or exclude.
The sheer quantity of material captured by such a approach does, however, lead to major quality assurance issues. Great care must be taken to ensure that all the linked files and images needed by a given web site have in fact been captured by the robot, as it may not be possible to go back later and retrieve missing files.
The experience of the State Library of Tasmania suggests that the full and successful capture of a web site so that it operates correctly on the preservation server is often very difficult. To meet this problem, robot-based capturing software will need to be very sophisticated, flexible enough to handle a wide variety of Web resources, and will need constant updating to reflect changes in the Web publishing environment.
This makes this approach only suited for organisations large enough to develop and maintain such complex software. It defers the task of providing quality retrieval to the captured sites and the workload of altering web site coding where needed to ensure that the site actually operates correctly on the preservation server. For most libraries and preservation bodies, targeted preservation is the best way to be absolutely certain that the desired Web resource is captured, captured correctly, and that it can be located, retrieved, and used in the future.
The major task for a library is to accept the size of the Web and to develop appropriate policies based on that library's technical capabilities. Interactive Web documents, performance systems, chat rooms, listservs, and database driven resources all present specific problems for Web preservation. These problems can be resolved: the content of a chat room can be recorded, a database can be copied and retained, a performance system can be copied complete with required software, etc.
Sampling and selective preservation strategies are likely to go hand in hand with organisational perspectives that result in the targeting of specific subject or geographic areas, or specific resource types. Libraries, for example, may need to focus initially on Document Like Objects (DLOs) and then move onto appropriate capture methods for other formats when technical resources permit.
Ideally a national strategy could ensure a cohesive coverage to such individual strategies. But until a national and cooperative collection strategy emerges, each organisation should follow its own needs and its own perceptions of its likely client base.
Once the desired Web content has been captured and preserved, the library must tackle the problem of description and access. The ongoing retrieval of preserved material can follow two basic approaches: Web resource descriptions can be incorporated into traditional library retrieval systems (eg library catalogues), or resources can be described using the emerging retrieval systems based on Dublin Core and other metadata schemas.
The range and characteristics of Web resources make metadata based description and retrieval tools preferable because of the lower unit cost of description and the simpler match of Dublin Core metadata elements to World Wide Web formats. The average cost of MARC cataloguing for a monograph would be prohibitive if applied to Web resources as well. The growth of cross-searching tools will allow these Dublin Core metadata based resources to be interrogated and integrated with the range of resources contained in library catalogues and other retrieval systems.
The nature of the physical access provided to the user for a preserved Web resource needs special mention. If preserved successfully, a preserved Web resource will present in exactly the same way to the user as the original Web resource presented on the World Wide Web. It may therefore be easy for the user to become confused and not realise that they are using a special or restricted part of the Web. The Web site that contains preserved material may look the same but it will not necessarily act the same. Certain activities, such as email links or links to outside resources, may not operate correctly at all because they refer to the open Internet.
Part of the answer to this ëboundaryí problem will rely on the strategies developed by the preservation agency to ensure that access to preserved Web documents is through specified channels. This will ensure that the user is informed of the special nature of resource that they are about to access. Software-based warnings or information messages can alert the user to the special nature of a preserved document. Ideally, special software could also intercede normal network based actions (a hyperlink to a non-preserved resource, an email link, etc) called from a preserved Web page to stop that action and inform the user of what has happened.
The methods devised to ensure appropriate access may also help a library to meet whatever copyright restrictions may also apply concerning access to that preserved material, or to its delivery across open networks. It may be that certain Web resources cannot be made available over open networks, or must be accessed only in certain physical locations, or with concurrent user limitations. Certain Web resources may also be accessible after a set time from capture, or only after the original resource is no longer available.
Our Digital Island began as a State Library of Tasmania project to identify the issues related to Web preservation and produce a realistic set of strategies to deal with those issues. These outcomes have been achieved and a range of processes are now established in the State Library workplace to ensure that Web preservation is part of normal work-flows. An operational Web site is in place, called ëOur Digital Islandí [HREF10] where preserved Tasmanian web content can be accessed.
The fundamental principal behind the Project was that it was better to preserve a small amount of Tasmanian Web content than to preserve nothing. In addition it was a requirement that the Web preservation process could be normalised without significant technological development or expense, and that the expertise to operate the service should be available from current staff.
The need to minimise specific or costly technological or software development was based only in part on resource limitations. It was the Projectís belief that software development in the Web arena must be able to cope with rapid obsolescence. A simple way to ensure that current software can be easily changed in the future is to ensure there is not a major investment in that current software. It is not easy to discard or replace an expensive software system after only one or two years of use.
A major outcome of the project has been a Web preservation selection policy. This policy acknowledges the range of material that should be collected and provides strategies that cover both long term comprehensive capture as well as short term selective capture. This selection policy [HREF11] is available from the Web site and will continually evolve and develop as experience is gained.
Web sites were initially selected for preservation on a representative basis from an arbitrary number of collecting areas. But as the State Libraryís experience grows, it will include a much wider range of possible resources. This policy will evaluate Web sites against wider issues than just the range of sites already captured.
The technological issues encountered during the early phases of the Project focussed on capture software capabilities and storage requirements. Capture software issues were resolved with the identification of Anawave WebSnake as a simple package to allow site capture. This package made it relatively simple to capture a remote site.
Anawave WebSnake [HREF12] allows a site to be captured in ëmirrorí mode, whereby all linked files are transferred, and the capture software recreates the directory structure of the remote site on the local hard drive. Alternatively sites can be captured in ëoff-lineí mode, whereby only the necessary files are captured in a flat directory structure. The capture software then resolves all the links contained within the documents to reflect that new flat directory structure. Experience has shown us that both options are necessary, and that certain sites can only be successfully captured one way.
The reducing cost of hard disk storage has resolved storage issues. In 1995 Web archival possibilities were considered by the State Library of Tasmania, but dismissed due to the high cost of data storage. Today the State Library has access to a dedicated hard disk with 18 GB capacity. This could be increased ten fold at relatively small cost. Disk data is backed up to tape on a regular basis.
Our Digital Island provides access to preserved Web sites through a specific Web URL containing only preserved material. The Web site provides alphabetical and subject lists of preserved sites, and a specific search engine that is only available from the site. This search engine provides access to both the textual content of the Web pages and to metadata fields. Primary Web pages have Dublin Core and ADMIN Core metadata content added by the State Library as part of the preservation process.
Other access issues are addressed by clear descriptions on the Web site itself and by the provision of a warning to users, whenever a primary preserved Web page is loaded, that they are using a preserved Web document. Currently this warning, written in Javascript and added to each major Web page, is quite intrusive. It is hoped to find a better solution in the future. In addition, preserved Web pages are included in the Web serverís robots.txt file to ensure that they are not accessed or indexed by external search engines and harvesters.
A significant aspect of the capture process is the review of captured pages to ensure their successful operation on the preservation server. This process involves checking for dead links and images and identification of the problem (eg was the link dead on the home site before it was captured).
Original code is not altered except where the preservation process itself is the cause of the problem. For example, a dead link from the original page is not corrected, but a relative address for an image will be changed if necessary to ensure its operation on the preservation server. There are many ways in which link and source referencing in HTML code can be site or server specific, and the modification of such references is essential to make the site actually operate on our server. We do not correct or attempt to make special purpose external links valid (eg page counters) nor do we correct such links if incorrect or outdated.
Copyright issues are not a major problem in Tasmania because of legal deposit legislation that requires a copy of all printed material in Tasmania to be deposited with the State Library. Printed material is defined in State Legislation to include electronic documents of almost any form. This allows us to legally possess a copy of any Tasmanian Web document on the State Libraryís Web server. However, the State Library is conscious of commercial sensitivities and is developing specific working relationships with appropriate commercial publishers.
Another significant outcome of the Project has been a management structure that identifies change as a key determinate for ongoing success. A review committee has been established that must assess ongoing issues, review emerging trends, and recommend changes as required. This committee reports to Library Management on a six-monthly basis.
The key statistics of Our Digital Island operation to date
are as follows:
Total disk space required: 220MB
Average disk space anticipated per site: 3MB
Average time to capture a web site: 10min
Average time to review and correct problems: 35min
Average time to add metadata: 10min
Average time to add a site to Our Digital Island web site: 10min
The Our Digital Island project has proved that a low-cost program to preserve Web resources is possible without significant software development, technical expertise, or resource allocation. This program could be implemented by a range of organisations and would see a practical beginning to the process of preserving Web resources.
The National Library of Australia has already led the way in Australian Web preservation with its PANDORA project. But there is still scope for increased activity and interaction on a national level to improve the extent and rate of Web material being preserved. Work should be done on these issues:
Anawave (1998).WebSnake [HREF12]
Conway, Paul (1996). Preservation in the Digital World [HREF4]
Commission on Preservation & Access and Research Libraries Group (1996). Preserving Digital Information: Final Report and Recommendations [HREF8]
Kahle, Brewster (1997). Preserving the Internet. Scientific American [HREF6]
Kranch, Douglas A. (1998). Beyond Migration: Preserving Electronic Documents with Digital Tablets. Information Technology and Libraries, (17)3, 138-146.
National Library of Australia, (1998a). National Strategy for Provision of Access to Australian Electronic Publications: A National Library of Australia Position Paper [HREF3]
National Library of Australia, (1998b). PANDORA Project: Preserving and Accessing Networked DOcumentary Resources of Australia [HREF7]
The Royal Library, National Library of Sweden, (1998?). The Kulturarw3 Heritage Project [HREF9]
Phillips, Margaret E. (1998). Tomorrow's Incunabula: Preservation of Internet publications. LASIE: Library Automation Systems Information Exchange, 29(3), 5 ñ 10.
Smith, Wendy (1998). Lost in Cyberspace: Preservation challenges of Australian Internet Resources. LASIE: Library Automation Systems Information Exchange [HREF5], 29(2), 6 ñ 25.
State Library of Tasmania (1998). Our Digital Island [HREF9]
Wainwright, Eric (1995). "Culture and Cultural Memory: Challenges of an Electronic Era" [HREF2], paper presented at the 2nd National Preservation Office Conference: Multimedia Preservation ñ Capturing the Rainbow , Brisbane, 28-30 November, 1995.
Lloyd Sokvitne © 1999. The author assigns to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The author also grants a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.