The Behind the Scenes Mechanics of the Web Archiving Strategy (WAS) Project at the University of Melbourne

Jon-Paul Williams [HREF1], Electronic Recordkeeping Systems Officer, Records Services [HREF2] , PO Box 4385, Melbourne University [HREF3], Parkville, Victoria, 3052. jpmw@unimelb.edu.au

Catherine Nicholls [HREF1], Project Manager - Web Archiving Strategy Project, Records Services [HREF2] , PO Box 4385, Melbourne University [HREF3], Parkville, Victoria, 3052. cvn@unimelb.edu.au

Abstract

The purpose of this paper is to explore the core components of the University of Melbourne's Web Archiving Strategy Project within the context of a racing car analogy. The racing track represents the organisational context of the project, the cars represent projects within the University (the prime focus of this paper is the project called the 'Web Archiving Strategy Project'), the drivers and team sponsors represent the people involved in the project, the tools and pit crew represent the policies and technical solutions for the project, while the 'next race details' represent where we plan to take the project in the future.

Introduction

The track is long and winding. A series of grandstands sit about half way along the track and they are placed in a prime position to witness both the start and finish of the race. Pit lane is a hive of activity. Crews have worked around the clock to get the elite machines ready. The engines are large and powerful. The cars are brightly coloured and covered in sponsorship logos. Anticipation fills the air. The drivers rev the cars tentatively at first, yet once they begin to spring to life, the revving becomes more insistent. The roar of the engines ignites the crowd and they are away…

The purpose of our poster is to explore the core components of the University of Melbourne’s Web Archiving Strategy within the context of a racing car analogy. The racing track represents the organisational context of the project, the cars represent projects within the University (the prime focus of this poster is the project called the Web Archiving Strategy Project), the drivers and team sponsors represent the people involved in the project, the tools and pit crew represent the policies and technical solutions for the project, while the ‘next race details’ represent where we plan to take the project in the future.

The Track

Established under Statute, The University of Melbourne is a public institution and accountable to Parliament for its actions, its performance and for aspects of its behaviour. Accountability is being able to provide an explanation or justification for events or actions and individual’s actions in relation to those events or actions. [HREF4]

Within the context of the University, inadequate records and recordkeeping can contribute to accountability failures through:

Yet, what does accountability mean in relation to the University web site? How are official University records identified on the web and what are the consequences of not clearly identifying or maintaining them appropriately for evidentiary purposes?

The University defines a static University web page as:

Officially endorsed statements or information about the University that have been published on the University web site in a manner that is compliant with the structural and technical requirements of University web page publishing standards. [REF1]

While an official University (dynamic) web page is defined as:

Officially endorsed information generated by a backend recordkeeping system and presented via a University web site “on the fly” either on the server end or in the client browser.[REF2]

The aim of these definitions is to draw a line in the sand as to what constitutes an official University web page.

Within the context of the University, there are also accountability issues related to the creation and maintenance of official University web pages, the authorisation of these pages and the manner in which content is kept accurate and up to date. Although none of this is being enforced at the moment, there is a general acceptance that in order to preserve the University’s academic and business reputation, it is important that the University’s web content is presented in officially endorsed templates and that the content is kept accurate and current. Therefore, within the context of the University’s web environment, the word ‘accountability’ means different things to different people and this has an impact upon the effectiveness and purpose of the Web Archiving Strategy.

The best way to explore this complexity is to draw on an example. The University Secretary’s Department, among other things, is responsible for administrating the minutes from the University’s chief governing body, the University Council. As far as the University Secretary himself (Mr Len Currie) is concerned, the master record of the University Council minutes exists in the signed paper version. The Council minutes that are published on the web are simply copies of the official record, and do not hold the same ‘weight’ as the master paper version. They are information copies provided to fulfill the obligation to make minutes of Council publicly available. If there was an inaccuracy between what was published on the web and what existed in the master paper version of the minutes, the University Secretary would consider this to be a web content accuracy issue rather than a records management or corporate accountability issue.

However, with regards to the publication of University Statutes and Regulations on the University web site, the University Secretary takes a different position. Although a signed paper copy exists of each University Statute and Regulation, the only way that people can access the official version of the statutes and regulations, is via the University web site. University Statute 1.4.1 states that the University Secretary must publish all statutes and regulations in a form approved by the Council. At present the Council determines that all statutes and regulations are to be published on the University web site.

Therefore, as far as the University Secretary is concerned, the ‘master’ version of all statutes and regulations exists both in paper and on the web site and both versions carry equal weight. The University Secretary believes that the publication of statutes and regulations on the University web site has wide ranging accountability issues including for example, the University’s obligations under the Commonwealth Trade Practices Act [REF3] which prohibits the University from publishing and disseminating false and misleading information, in any recognised format, including the web. Within the University environment, this issue is of particular concern to academic faculties and departments or perhaps any area of the University web site that specifically provides information that students may rely upon.

This accountability manifests itself through a range of requirements including:

The Car

In 2002, the University of Melbourne established a Web Archiving Working Group. The group had no funding or extra resources and was primarily put together to flesh out some of the issues surrounding web archiving and to develop a business case for the development of the Web Archiving Strategy Project. The Web Archiving Working Group put together two business cases and received funding for the project in 2003. The funding provided for the part time involvement of a Project Manager and recordkeeping expert, the Web Archiving Technician’s salary (part time), and some other equipment and office resources. The project is supported by a Working Group [REF4] and reports to a Steering Group. [REF5]

The Web Archiving Strategy Project has three key phases. The first was a research phase, which involved investigating some of the core issues of the topic and developing a policy draft. This phase is now complete. The second phase (which will begin in July 2004) involves a pilot project which aims to investigate some of the technical and theoretical concepts identified through the earlier research phase. The third phase which is anticipated to begin next year, involves developing the procedural documentation and implementing the actual strategy in a planned roll-out across the University.

The main goal of the WAS Project for 2003 and early 2004 was to write a policy document that would broadly identify:

The policy draft is currently undergoing an approval process.

The Drivers and Team Sponsors

The key stakeholders in the WAS Project to date have included University Archives, University Web Centre, University Library and the University Records Services unit. Administratively, the University Archives, Web Centre and Library are all located within the Information Division of the University, whereas previously noted, Records Services sits in the University Secretary’s area, which is part of the Central Administration.

Catherine Nicholls is the Project Manager and recordkeeping expert for the project. Jon-Paul Williams is the Web Archiving Technician. The Steering Group membership includes Suzanne Clark, Martine Booth, Michael Piggott, Donna Mc Rostie and Eve Young. Winston the Web Archiving Duck is the Project’s mascot.

Tools and Pit Crew

One way of potentially dealing with some of the short term version control issues with web pages comes via the University’s proposed Content Management System. Within the University of Melbourne, the Content Management System (CMS) is a term currently used to describe a set of processes and tools used to support the way information is created, managed and published on the University web site.

One of the features of the University’s CMS will be the ability to manage versions of web pages over time. It is anticipated that this will allow non-current versions of the web pages and their metadata to be retrieved, which is a slight improvement on the current situation, where current web pages are often just overwritten with a new version. Often there is no way of retrieving the older version of the web page once it has been updated. Part of the pilot phase of the WAS Project will be to assess how the CMS anticipates managing web pages over time and whether it can manage them as records. We are anticipating that the CMS will only provide short term storage for the web records (possibly up to 2 years at this stage).

The CMS may also, through the application of clever metadata, assist in the long term recordkeeping requirements of the University by automatically identifying web records via metadata, but this area requires more research and development. [REF6] One possibility we intend to explore involves the concept of applying a classification/keyword descriptor to the web page at its creation point. Ideally, if a disposal sentence were attached to a classification/keyword term that was included in the metadata at the creation stage, this would help to identify web records and maybe even open up the possibility of automating some of the web archiving processes. For example, if web pages were assigned a trigger point at the creation stage (via a classification/keyword in the metadata tags) then pages suitable for short or long-term archiving could be searched for and selected automatically.

In the short term, it is recognised that part of the University’s problem in regards to short term accountability issues with web pages may be dealt with by the CMS. It should be emphasized however, that the CMS is not a total records management or indeed recordkeeping solution, but it may be a practical way to alleviate some of the current web page version control issues.

The CMS hasn’t been the only technical option investigated for web archiving solutions. Finding technical solutions for the Web Archiving Strategy has been a core objective of this project. The volume of pages is too large to do it all manually over the long term.

Some of the other technical solutions being investigated, include PageVault software and Pandora.

PageVault is a software solution designed by Project Computing (an Australian IT company) to ‘capture and archive’ all ‘unique or novel’ responses delivered from a web server, according to configuration settings specified by a user. The software can be easily configured to ignore responses, based on either content-type or flexible URL pattern matching.

All content is treated as a bytestream by PageVault, which can archive any content generated by a web server: html, text, images, Microsoft Office documents, PDF, sound etc. PageVault archives all unique responses, whether static html, dynamic html generated from Perl, cgi-bin, servlets, scripts, server-side includes, databases, etc. Archived material is compressed in transmission to the backend repository where it’s stored in a compressed format. At the University of Melbourne, we envisage using PageVault to capture dynamic web based transactions, in particular that part of the transaction that is captured ‘on the fly’ and shows up on the web page for example, as a receipt or a confirmation of payment.

PANDAS software (PANDORA Digital Archiving System) was developed as a web-based management system to facilitate the processes involved in the archiving and preservation of online Australian publications. PANDAS is designed and built by the National Library of Australia (NLA) and was first implemented in June 2001. A major redevelopment was undertaken following the initial implementation and the redesigned second version was implemented in August 2002.

The purpose of the risk management criteria, our work with the Content Management System and the exploration of some of the other software solutions is also about looking for ways to bring some of these web page accountability issues to the attention of the University community. We feel that the only way to get people to take notice of our racing car, is to help them see (via the risk assessment) that they may be dealing with web pages that need to be retained for accountability purposes and secondly offer them some sort of short term solution to help them manage them.

The Next Race

We have come to the realisation that racing the Web Archiving Strategy Project within the University context track is an ongoing challenge. The purpose of our mission being to advise the University community about the consequences of ignoring recordkeeping issues within the web environment. We have to distinguish between the different levels of accountability and what it means in regards to web pages as well as attempt to identify web records or provide clear and easy guidelines to help the University community identify web records. We are also heavily committed to finding partnerships and working with other stakeholders in order to promote the concept of good records management on the web. As light falls on the racing track, and the last fan has gone home, we hope that the message of web archiving and the WAS Project car’s performance continues to leave its mark! [REF7]

References

[REF1] University of Melbourne - Web Archiving Strategy (WAS) Policy Draft 8.2, April 2004.

[REF2] University of Melbourne - Web Archiving Strategy (WAS) Policy Draft 8.2, April 2004.

[REF3] Part V of the Trade Practices Act 1974 (Cth) contains a number of provisions that are intended to protect consumers. In the University's case, the consumer may be anyone from a student to a competitor, all of whom have standing to complain to the ACCC.

Section 52 of the Act prohibits conduct in trade or commerce that is misleading or deceptive or is likely to mislead or deceive.

Conduct includes words, pictures, actions or, where it is misleading, silence. The conduct could have been exhibited to one party, for example in a letter or a negotiation, or generally, for example in a handbook or advertisement. The use of a disclaimer may be helpful in trying to establish there was no intention to deceive, but does not provide a defense. Where brochures and publications include information that may be relied on, e.g. financial or course information, there should be a use by date. (Source: University of Melbourne: Compliance Advisor, Legal & Compliance).

[REF4] Working Group membership includes 2 faculty representatives, 2 representatives from the Information Division, a general administrative representative as well as Catherine Nicholls and Jon-Paul Williams.

[REF5] The Steering Group includes the University Web Centre Manager, the University Archivist, the Manager Records Services and a senior executive from the Information Division.

[REF6] Prior to the establishment of the Web Archiving Strategy Project, the Metadata Working Group (MWG) of University’s Information Strategy Advisory Committee (ISAC) proposed a web resource metadata standard consisting of 14 Dublin Core (DC) fields plus some University of Melbourne administrative fields. Further investigation needs to be carried out to align this set with AS 5044:2002, AGLS Metadata Element Set which is an Australian standard for cross-domain resource description. See: http://www.naa.gov.au/recordkeeping/gov_online/agls/metadata_element_set.html

A significant addition in the AGLS metadata set is the provision of a Function element to record ‘the business function of the organisation to which the resource relates’.

To further our recordkeeping goals and initiatives we will also be investigating application of the Australian Recordkeeping Metadata Standard currently due for release later this year and founded on the products and subsequent implementations of the Monash University led, SPIRT Recordkeeping Metadata Research Project. See: http://www.sims.monash.edu.au/research/rcrg/research/spirt/index.html

The Australian RKMS is being developed within the framework of ISO/PDTS 23081 Information and documentation – Records Management Processes – Metadata for Records, 2003-12-30.

[REF7] This paper is based on a paper written by Catherine Nicholls from the University of Melbourne, Australia, which is to be presented at the Association of Canadian Archivists (ACA) 29th Annual Conference, in Montreal, May 2004. The paper is titled 'Creating Road Signs and Encouraging Safe Driving on the Information Superhighway: Accountability and Compliance in the Web Archiving Environment'.

Hypertext References

HREF1 - Contacts List, Records Services, University of Melbourne.
http://www.unimelb.edu.au/records/contact/index.html
HREF2 - Home Page, Records Services, University of Melbourne.
http://www.unimelb.edu.au/records/index.html
HREF3 - Home Page, University of Melbourne.
http://www.unimelb.edu.au/
HREF4 - University of Melbourne - 'Chapter 1 - Overview of Records Management and Recordkeeping' in Records Management Policy and Procedures Manual.
http://www.unimelb.edu.au/records/manual/chapter1.html
HREF5 - University of Melbourne - 'Chapter 1 - Overview of Records Management and Recordkeeping' in Records Management Policy and Procedures Manual.
http://www.unimelb.edu.au/records/manual/chapter1.html
HREF6 - University of Melbourne - 'Chapter 1 - Overview of Records Management and Recordkeeping' in Records Management Policy and Procedures Manual.
http://www.unimelb.edu.au/records/manual/chapter1.html

Copyright

Jon-Paul Williams and Catherine Nicholls, © 2004. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.