Building an
Institutional Research
Repository from the Ground Up: The ARROW Architecture Experience
Dr Andrew Treloar [HREF29],
Project Manager,
Strategic Information Initiatives, Information Technology Services [HREF30] & ARROW [HREF32]
Technical Architect &
Adjunct Librarian, Monash University Library [HREF33].
Building
3A, Monash
University [HREF31], Victoria,
3800.
Email: Andrew.Treloar@its.monash.edu.au
Abstract
This paper describes the thinking behind the technical architecture for
ARROW - Australian Research Repositories Online to the World (a
DEST-funded project under the Research Information Infrastructure
Framework for Australian Higher Education). The paper begins by
describing the vacant lot - the context in which the project came
about. It then moves on to the design brief for the architect - the
list of requirements. Next comes the resulting architectural drawings -
the broad model and list of functions. In order to turn a blueprint
into reality, one needs building materials - in this case the pieces of
software required. Finally the paper discusses the state of the
building site, and when the 'house' might be open to its first
visitors.
1. Vacant Lot
This is a story about building an institutional research repository.
More specifically, it is about designing the architecture to make this
building possible (later papers will talk about the 'building' itself).
But before the architecture can occur, the right
environment needs to exist. What was the vacant lot that made it
possible for this building to even be thought about? In this case the
vacant lot has two components: a general one and a more specifically
Australian one.
1.1 Overall context
There is a growing interest among academic institutions in
collecting, preserving, reusing and creating value-added services from
digital content produced in and for research, teaching and learning.
The emphasis on research outputs and collaboration, and distance,
flexible and online learning, together with developments in information
technology, has led to an increased awareness that the digital content
being created by members of the academic community is an institutional
asset. This content is also increasingly being
recognised as an institutional challenge, requiring both tactical
management and a strategic response.
At the same time many academic libraries are responding to the
challenges of new technologies by taking the opportunity to redefine
their fundamental role in the creation, distribution and provision of
access to information. Over the past decade libraries have moved almost
completely towards a digital platform for management of the information
(both print and electronic) that they acquire or subscribe to. They
have built significant digital collections of material published by
others, and they are increasingly producing new content themselves [Harboe-Ree et. al. (2004)]. Often this content
originates from, or is the
intellectual property of, their own institutions.
Meanwhile, all around the world, universities, their libraries,
faculties, research centres and information technology and course
development units, are trying to cope with the digital revolution.
There is a growing recognition and articulation of the convergence that
is occurring among the various digital initiatives in which
universities are engaged, and the opportunities for potential synergies
and more significant outcomes through collaboration and
interoperability.
As one example, the COLIS (the Collaborative Online Learning and
Information
Services model [HREF3]) work
at Macquarie
University has
focused on testing the feasibility of interoperable standards as a way
of managing interactions between a range of electronic services.
Through the success of the COLIS model, McLean and others have
demonstrated that the new electronic environment can and must comprise
a complex interactive matrix that is dependent on the information
resources mentioned above, as well as on user directories, content and
rights management software, and metadata repositories.
Sally A Rogers, from Ohio State University, argues that the full
array of a university's digital assets and information services should
be broadly defined, and should include the library's catalogue, the
electronic journals, reference databases and other electronic resources
available through the library, as well as institutional repositories
and resources created or collated elsewhere in the university, such as
course material [Rogers
(2003)]. She notes the overlapping of such
initiatives as digital collections, course web sites, electronic course
packs and learning objects, the desirability of integration to search
across these repositories and the development of standards to promote
interoperability. Rogers also highlights the potential of increased
interoperability and connectivity to generate innovation in research,
teaching and learning.
1.2 Australian context
It was against this backdrop that the November 2002 report of the
Higher Education Information Infrastructure Advisory Committee (HEIIAC)
of the Australian Government Department of Education, Science and
Training (DEST) [DEST (2002)] identified the
following critical
features of an enhanced
research infrastructure:
- information infrastructure resources should optimise the efforts
of researchers in the higher education sector to create, manage,
discover, access and disseminate knowledge;
- access to the research information infrastructure should not be
constrained by institutional affiliations, geographic locations or
disciplines of individual researchers;
- collaboration among libraries has improved the effectiveness of
individual institutions, and further collaboration, clear strategies
and a shared vision would significantly improve the coordination of the
national research infrastructure;
- opportunities should be sought for the academic community to
regain control of scholarly publishing; and
- computing and communication technologies provide new
opportunities for the creation, management, storage and dissemination
of information.
The HEIIAC report was primarily concerned with managing the current
problems associated with scholarly communication and publishing, and it
stressed the need to adopt a national collaborative approach. As
already discussed, a range of players embrace scholarly
communication strategies and argue that they should be
incorporated into a more holistic approach to the management of
institutional digital content and intellectual capital.
The merging of these two approaches would yield substantial
benefits to Australian university communities, consistent with the
following statements of principle:
- Australian universities have a commitment to support and promote
their institutions' research activity through the creation and
preservation of digital content, especially institutional repositories
and electronic publishing.
- Australian universities have a commitment to help their
institutions achieve their goals more effectively by assisting with the
integration of digital resources.
- Australian universities have a commitment to collaborating
nationally and internationally in the achievement of a more integrated
approach to the management and interoperability of digital content. [Harboe-Ree and Treloar 2004]
These statements reflect the HEIIAC objectives and place them into
a framework that, if implemented, would improve institutional and
national efficiency and effectiveness. The challenge for HEIIAC was to
turn
these principles and objectives into action.
1.3 DEST RII Process
In June of 2003, the Australian Commonwealth Department of
Education, Science and Training issued a call for proposals to "further
the discovery, creation, management and dissemination of Australian
research information in a digital environment" [DEST
(2003a)].
This sought to "fund proposals which help promote Australian research
output
and help to build the Australian research information infrastructure,
through
the development of distributed digital repositories and common
technical services
that manage access and authorisation to
these."
The guidelines for submissions identified the following
requirements to be met by successful bids:
- The application must provide clear evidence of the overall need
for the project proposed in terms of the strategic and long-term
benefits for the higher education sector in Australia as a whole and
identify the specific outcomes that will be derived.
- The application should indicate relevance to sector-wide needs
and priorities and demonstrate that the proposal is an innovative
approach.
- The application must clearly demonstrate that the proposal is a
cost effective response to an identified problem and will generate
savings or productivity gains through its application.
- The application should detail the nature and degree of
cooperation between collaborating institutions.
- Where relevant, the application should bear in mind future
requirements and outline strategies to sustain the project beyond the
period of Commonwealth funding.
- Institutions should be mindful that in any infrastructure
developed under this project the enabling architecture should be both
effective and reasonably future proof.
In response to this call, 14 projects were submitted of which four
were funded [DEST (2003b) ]. The successful
projects were:
- The Australian Research Repositories Online to the World (ARROW)
[HREF25]
- Australian Digital Theses Program Expansion and Redevelopment
(ADT) [HREF13]
- Towards an Australian Partnership for Sustainable Repositories
(APSR) [HREF27]
- Meta Access Management System (MAMS) [HREF28]
These four projects were funded for a combined total of A$12
million over a period of 3 years, with funding commencing at the start
of 2004 [HREF11].
The focus of this
paper will be the architectural design of the ARROW Project.
2. Design Brief
The original design brief was encapsulated in the Summary section of
the ARROW Bid document sent to DEST. This read:
"The ARROW project (ARROW) will identify and test a software
solution or solutions to support best-practice institutional digital
repositories comprising e-prints, digital theses and electronic
publishing. A wide range of digital content types will be managed in
these repositories. The NLA will develop a repository and associated
metadata to support independent scholars (those not associated with
institutions). A complementary activity of ARROW is the development and
testing of national resource discovery services (developed by the NLA)
using metadata harvested from the institutional repositories, and the
exposing of metadata to provide services via protocols and toolkits.
This will include a potential path for the redevelopment of the
Australian Digital Theses (ADT) metadata repository incorporated into
the NLA’s national resource discovery
services.
Initially ARROW will be tested in the four partner institutions,
prior to it being offered more widely across the higher-education
sector. The solution will be open-standards based, or will support open
standards, and will facilitate
interoperability within and between participating institutions.."
This is a very high-level statement. What does it mean when fleshed
out a bit? The best way to get an accurate sense of this is to focus on
the content streams that ARROW will have to manage and the content
types it
will have to deal with.
2.1 Content Streams
The functions that ARROW will perform can best be characterised in
terms
of different content streams.
2.1.1 E-print
repositories
An e-print repository stores and makes available (in digital form)
working papers, pre-prints (not yet published in the traditional
literature) and post-prints. E-print repositories have been
proliferating in recent years. Most have been set up by universities,
but many have also been established by scholarly and professional
societies and higher education research centres. Australian
universities running e-prints repositories include The Australian
National University, Monash University, The University of Melbourne,
The University of Queensland, and Queensland University of Technology.
The increased activity around e-prints has been facilitated by the
development of free, open-source software [HREF12]
that
manages e-print repositories.
A key feature of these repositories is that content is usually
available on an open-access basis (anyone can read or view it and no
fees are payable). Many e-print repositories also work on a
self-submission basis, with researchers depositing material into the
repository themselves using an online deposit process. The rationale
behind the growing e-prints movement is to reclaim institutional
scholarly output and make it widely accessible internationally, thus
removing barriers to learning and research, and improving its
availability
and citation.
2.1.2 Digital thesis
repositories
A digital thesis repository stores and makes available online, in
digital form, graduate research output (M.A. by research and Ph.D.).
Digital theses in these repositories are offered on an open access
basis. In Australia the Australian Digital Theses Program
[HREF13] is a national
collaborative distributed
database of digitised theses produced at Australian Universities.
Twenty-two higher education institutions are participating members of
the Program, which uses deposit-process software [HREF14]
first developed at Virginia Polytechnic
Institute in the United States of America.
2.1.3 Electronic
publishing
A growing number of higher education institutions are trying to
establish sustainable publishing alternatives to reclaim the scholarly
output currently published in heavily protected commercial journals and
monographs. Institutional e-presses aim to offer electronic publishing
services and
functionality similar to those offered by commercial presses publishing
product online, but in a way that is more aligned with institutional
objectives, thereby tackling problems associated with the current
scholarly publishing climate. These problems include pricing and
intellectual property issues, as well as long lead times for
publication and publishing models that do not allow for publication of
media rich titles.
The activities of an e–press can range from digitising material
originally designed for print and making it available online, through
to the publication (in the sense of making public) material that was
born digital and that can only be fully represented digitally.
E-presses are more akin to traditional publishing than e-print
repositories in that e-press content tends to be offered on a
subscription and/or pay-per-view basis.
As with e-print repositories, the Australian higher education
sector is experiencing significant activity in this area. Both Monash
University and The Australian National University are establishing
e-presses, and Royal Melbourne Institute of Technology Publishing
[HREF15] has been
engaged in electronic
publishing for several years now.
2.1.4 DEST Returns
Each year, Australian universities need to send to DEST information
about their research output for the previous year. In most
universities, this
process involves manual data collection using paper forms which are
then keyed into a database or spreadsheet. This is tedious and
susceptible to error. In addition, the end result is a largely static
document with no way to link from the publication information to the
publications themselves.
ARROW wanted to see if it was possible to partially automate the
gathering of publications for the annual Department of Education,
Science and Training returns and storage of both the publications and
required metadata in the institutional ARROW repository. This would
meet the following objectives:
- systematic accumulation of a critical mass of content
- simplification of Department of Education, Science and Training
return creation by universities
- facilitation of the way in which the Department of Education,
Science and Training verifies compliance
ARROW also wanted to see if it also would be possible to
enable universities to enter into an ongoing dialogue with their
researchers about the issues associated with academics signing over
copyright in research output.
2.1.5 Independent
Scholars
Of course, not all research takes place in a university. Much also
occurs in research institutes of one sort or another, in R&D
centres in corporations or even in informal locations (I call this the
Researcher in the Backyard Shed). Researchers at institutions
without institutional repositories would find it difficult to make
their research visible. As ARROW was seeking to capture and make
visible as much Australian research as possible, it would be useful to
find a way to deal with this potential content stream.
2.2 Content types
2.2.1 Content Type
Philosophy
Another part of the design brief process was deciding on what content types (as opposed to streams) would
be accepted. The project decided to adopt a variant of the model
developed by MIT in its
DSpace [HREF16] implementation.
The DSPace
philosophy can
be
summarised as follows:
- Lots of digital material is already lost
- Most digital material is at risk
- Preserving bits is better than nothing
- It is important to capture as much information as possible
- It will be necessary to evaluate cost/benefit trade-offs
over time
We also decided to be informed by the National Archives of Australia
guidelines
on digital formats [HREF26].
Based on this, ARROW
decided to accept
three
types
of
content
:
- Supported
- The format is recognized, and the hosting institution is
confident it can make bitstreams of this format usable in the future,
using whatever combination of techniques (such as migration, emulation,
etc.) is appropriate given the context of need.
- Known
- The format is recognized, and the hosting institution will
promise to preserve the bitstream as-is, and allow it to be retrieved.
The hosting institution will attempt to obtain enough information to
enable the format to be upgraded to the 'supported' level.
- Unsupported
- The format is unrecognised, but the hosting institution will
undertake to preserve the bitstream as-is and allow it to be retrieved.
On the vexed subject of Lossy vs Lossless formats, the decision was
made that wherever possible, ARROW would endeavour to store data
objects in lossless digital formats (these are formats that do not
throw away
information when compressing the file).
Lossy
formats
(which do throw away information during compression) might
be
stored
in
addition, or rendered on the fly (where possible). Storage in lossy
formats would be used only as a last resort.
2.2.2 Supported
Formats
For Textual content, the supported formats are:
- XML
- Files with an accompanying DTD or schema preferred. If not,
then well-formed XML is acceptable.
- Rich Text Format (RTF)
- Adobe PDF
- NOTE: This content will be migrated to PDF-A once this is
standardised
- HTML
- Validating as XHTML. Content that does not validate will need
to be converted.
For Still Images, the supported formats are:
- TIFF (Tagged Image File Format) [HREF34]
- JPEG
- Store with no-compression, migrate to JPEG-2000 over time
- PNG (Portable Network Graphics) [HREF35]
- EPS
- SVG (Scalable Vector Graphics) [HREF36]
For Moving Images, the supported formats are:
For Audio, the supported formats are:
For Multimedia content, the supported formats are:
- SMIL (Synchronized Multimedia Integration Language) [HREF37]
2.2.3 Known Formats
For Textual content the following formats are known:
- Word/Excel/Powerpoint
- all versions, all operating systems
NOTE: The reason for including Microsoft Office file formats is
simply a recognition of the market reality. If alternatives (such as
StarOffice [HREF39] or
OpenOffice [HREF40] become
more widely deployed in the target
environments for ARROW, these list may well be augmented).
For Still Images the following formats are known:
- GIF
- MrSID (Muitl-Resolution Seamless Image Database) [HREF38]
For Moving Images the following formats are known:
- Windows Media
- AVI
- Quicktime video encodings other than MPEG-4
For Audio the following formats are known:
For Multimedia content, the supported formats are known:
2.2.4 Unsupported
Formats
All other formats would be unsupported.
3. Architectural
Drawings
Now that we had a clear design brief it was possible to move on to the
next step: deciding the broad architecture. This involved a series of
iterative steps, as well as a lot of research into what approaches
similar projects overseas had adopted. We ended up defining three
categories of required repository functionality.
3.1 Common Repository
We decided that, if possible, we wanted all the various content types
to be stored in a common repository. This would:
- facilitate linkages between items
- allow for more efficient management of the content and the
infrastructure
- enable exposure of all of an institution's public research output
through a common mechanism
3.2 Content Management
and Workflow
In order to get the content into the common repository, we needed a way
to efficiently manage different classes of content contributors and
different content streams. We ended up deciding to define a series of
Content Management and Workflow modules, corresponding to the content
streams discussed under section 2.1. Each of these modules would have
its own content submission forms and workflow. Each would also have
specific functionality to deal with the requirements of that particular
stream type.
3.2.1 ePrints
| Objective |
Module to submit and manage e-prints. |
| Deliverables |
Software, based on the ARROW architecture, that provides no
less functionality than the eprints.org software. |
| Issues |
Management of content self-submission and administrative
management. |
3.2.2 eTheses
| Objective |
A module that will manage thesis metadata and submit digital
theses. |
| Deliverables |
Software, based on the ARROW architecture, that
provides no less functionality than the current Australian Digital
Theses Program software and includes OAI-PMH compliance. |
| Issues |
Data capture from various sources; efficient harvesting from
institutional repositories; identification of software; performance and
scalability requirements; interactions with other metadata services. |
3.2.3 ePress
| Objective |
To create or integrate a module to manage a fully functional
electronic press. |
| Deliverables |
Software, based on the ARROW architecture, that provides
sufficient functionality to run an open-access ejournal electronic
press. |
| Issues |
Integration of existing electronic press software. |
3.2.4 DEST Research
Directory
| Objective |
Testing of the feasibility and effectiveness of using an
ARROW repository to support the annual Department of Education, Science
and Training returns. |
| Deliverables |
Repository holding a proportion of the institution's
Department of Education, Science and Training 2003 returns. |
| Issues |
Management of content submission from academics; embedding
use of repository in institution-collection process. |
3.2.5 NLA Repository
| Objective |
Installation of an independent scholars' repository at the
National Library of Australia. |
| Deliverables |
Repository, compliant with the ARROW architecture, adapted
for independent scholar submission. |
| Issues |
Management of content submission from independent scholars. |
3.2.6 Other
applications
We also recognised that the ARROW infrastructure would be potentially
applicable to a wider range of problems. For this reason we left open
the possibility of adding other Content Management and Workflow modules
later on.
3.3 Search and
Exposure
The ability to locate appropriate content for citation purposes is a
critical success factor in creating reliable scholarly communication
and increasing the impact of research. ARROW decided to develop a
nationally available resource discovery service to provide access to
Australian research output. The project will establish automated
mechanisms for harvesting and re-purposing metadata from institutions
and individual researchers. This will be done by applying international
standards, specifications and technologies to ensure interoperability.
Resource discovery will be supported by descriptive metadata. Other
types of metadata may also be generated to support digital rights
management, persistent identification, and archiving and preservation
to ensure the longevity of scholarly content. In addition, it will be
possible to search ARROW repositories through a range of discovery
tools (such as education portals or search engines). This exposure will
increase awareness of unique Australian content, both nationally and
internationally. The project will also seek to expose published
Australian research in commercial repositories, such as those created
by large journal publishers.
3.4 OLAD
The end result of the architectural decisions in each of the categories
of Common
Repository, Content Management and Workflow and Search
& Exposure was a layered architecture. The notion of a layered
architecture is not particularly controversial. Such architectures have
been preferred since at least the days of the International Standards
Organisation Open Systems Interconnect seven-layer reference model for
network services. In the Digital Library field these sorts of
high-level models are so common that the project group took to
referring to 'obligatory' layered architecture diagrams. Figure 1
therefore is the OLAD
(Obligatory Layered Architecture Diagram) for ARROW.

Figure 1: Obligatory Layered Architecture Diagram for ARROW.
4. Building Materials
Now that we had defined the architecture, we had to work out how we
were going to build it. In construction terms, what building materials
were
available and what were we going to chose?
4.1 Foundation - the
repository
We recognised very early on that the choice of repository was
foundational. Particular repository technologies would in turn
determine
the functionality we could provide and the way we could provide it.
Much of the latter half of 2003 was spent in careful analysis of
available candidates, based on a mixture of:
- reading publically available materials including:
- system documentation
- published articles/conference papers
- online presentations
- notes from conference sessions
- lurking on mailing lists
- downloading the software and 'kicking the tyres'
- attending conference sessions (and talking to presenters
afterwards)
- talking to other users to get a less-partisan assessment
As a result of this work, we rapidly settled on two likely candidates:
DSpace and FEDORA.
DSpace [HREF16]
is a joint activity between MIT
Libraries and Hewlett-Packard to jointly develop a software system to
enables institutions to:
- Capture and describe digital works using customized workflow
processes
- Provide access to an institution's digital works over the web,
so users can search and retrieve items in the collection
- Preserve digital works over the long term
It is being made available under the BSD open source license to
other groups to run as-is, or to modify and extend as needed.
The current version of DSpace (1.1.1 - version 1.2 is anticipated
in April 2004) can best be thought of as a
general-purpose repository application, with a series of both
hard-wired and preferred behaviours. It is designed to provide stable
long-term storage needed to house the digital products of MIT faculty
and researchers. DSpace is intended to have different advantages for
different stakeholder groups:
"For the user: DSpace enables easy remote access and the
ability to read and search DSpace items from one location: the World
Wide Web.
For the contributor: DSpace offers the advantages of digital
distribution and long-term preservation for a variety of formats
including text, audio, video, images, datasets and more. Authors can
store their digital works in collections that are maintained by MIT
communities.
For the institution: DSpace offers the opportunity to provide
access to all the research of the institution through one interface.
The repository is organized to accommodate the varying policy and
workflow issues inherent in a multi-disciplinary environment.
Submission workflow and access policies can be customized to adhere
closely to each community's needs." [HREF17]
While DSpace grew out of the needs of MIT, a group of North
American and European universities are now participating in the DSpace
Federation [HREF18],
which will test the existing software,
and offer suggestions about how to further develop and improve it.
DSpace supports a wide range of content types [HREF19],
and particular installations can easily extend the range available.
FEDORA is both a software platform and an architecture (it stands
for
the Flexible Extensible Digital Object and Repository Architecture).
The architecture came out of Digital Library work done in the computer
science field in the late 1990s [Payette and Staples
(2002)]. The history of the FEDORA repository software is described
on its website as follows:
"In the summer of 1999 ... the
[University of Virginia]
Library's
research and development group discovered a paper about Fedora written
by Sandra Payette and Carl Lagoze of Cornell's Digital Library Research
Group. Fedora was designed on the principle that interoperability and
extensibility is best achieved by architecting a clean and modular
separation of data, interfaces, and mechanisms (i.e., executable
programs). With Cornell's help, the Virginia team installed the
research software version of Fedora and began experimenting with some
of Virginia's digital collections. Convinced that Fedora was exactly
the framework they were seeking, the Virginia team reinterpreted the
implementation and developed a prototype that used a relational
database backend and a Java servlet that provided the repository access
functionality. The prototype provided strong evidence that the Fedora
architecture could indeed be the foundation for a practical, scalable
digital library system. In September of 2001 The University of
Virginia received a grant of $1,000,000 from the Andrew W. Mellon
Foundation to enable the Library, in collaboration with Cornell
University, to build a sophisticated digital object repository system
based on the Flexible Extensible Digital Object and Repository
Architecture (Fedora). The Mellon grant was based on the success of the
Virginia prototype, and the vision of a new open-source version of
Fedora that exploits the latest web technologies. Virginia and Cornell
have joined forces to build this robust implementation of the Fedora
architecture with a full array of management utilities necessary to
support it." [
HREF41].
Increasingly, the term FEDORA (which was first used
over 5 years ago as an acronym for the architecture) is now being used
to refer to this software implementation. In this latter sense, FEDORA
is "an open source, digital object
repository system using public APIs exposed as web services." [Staples,
Wayland
and Payette (2003)].
FEDORA can best be thought of as
services-mediation infrastructure, rather than an off-the-shelf
application. It can use web services to call other services as well as
expose its own services using web services standards. Key to the FEDORA
architecture (yes, I know this is like referring to an ATM Machine...)
is its underlying object-based model. FEDORA stores digital content
objects, either as datastreams contained within the repository or as
links to external resources. It also stores disseminators, which are
ways to render these digital content objects. The software maintains
bindings between content objects and their disseminators. Each object
has a default disseminator, but may be able to be disseminated in other
ways. This architecture is extremely flexible, and provides significant
advantages as a platform on which to build other applications.
Version 1.2 of FEDORA, released in late December 2003, provides
versioning of both objects and their disseminators, as well as a
Java-based Administration GUI.
There is also a range of other open-source repository projects
underway. The Soros Institute is currently maintaining a document which
summarises the functionality of many of them [HREF8].
In
addition to DSpace, the current version also reviews FEDORA, CDSWare,
MyCoRe, i-Tor, eprints.org and ARNO. These each come out of particular
responses to the challenges of managing large amounts of digital
content, and each have their own strengths and weaknesses.
4.1.4 Selection
At the time of writing this paper the
final selection had not been announced. It is hoped that by the time of
the conference, the announcement will have been made. This paragraph
will then be updated to explain the reasons behind the selection.
4.2 Framing it up -
the application development framework
One of the things that the repository may determine is the choice of
application development framework. This is because some repositories
only allow particular languages to call their Application Programming
Interfaces (APIs). We wanted to be able to code in a variety of
languages (not be restricted to one) and we wanted to be able to expose
repository functionality via Web Services. These two points are
partially inter-related: having web services makes it much easier to
use a range of languages.
4.3 Doors and Windows
- the search and exposure layer
As discussed above we wanted to make items in ARROW repositories as
accessible as possible. We decided to target three very different
technologies.
4.3.1 OAI-PMH
The Open Archives Initiative's Protocol for Metadata Harvesting
(OAI-PMH) was created to facilitate discovery of distributed resources,
such as those contained in a repository. The OAI-PMH achieves this by
providing a simple, yet powerful framework for metadata harvesting.
Harvesters can incrementally gather records contained in OAI-PMH
repositories and use them to create services covering the content of
several repositories.
[Van de Sompel, Young and Hickey (2003)]. OAI-PMH
is
rapidly gathering strength as a way of providing federated resource
discovery services and was seen as essential to the success of ARROW.
The National Library will use OAI-PMH where available (and other
technologies where not) to harvest the metadata from ARROW and other
institutional repositories. These metadata will then be used to provide
national and international resource discovery for Australian research.
This national resource-discovery service will also link with other
national services delivered by the National Library for the Australian
Digital Theses Program and the international Networked Digital Library
of Theses and Dissertations.
4.3.2 Google
There is little need to discuss the success of Google at a Web
conference. For most students (and probably for most staff!) Google is
the resource discovery mechanism of choice. Enabling Google to access
at least the metadata (and preferably the full text) of items in ARROW
repositories was an easy choice to make. In practice, this means
provision of a robots.txt file and publically-available content in a
directory location accessible by Google spidering software.
4.3.3 SRU/SRW
The third exposure layer was in some ways a less obvious choice. Both
OAI-PMH and Google are 'proxy' search services. That is, they collect
proxy records and place them in a database where they can be searched.
Such proxy systems run the risk of always potentially being out of date
(if only slightly). We therefore wanted to make it possible for other
search services to connect directly to ARROW repositories and run
interactive searches. The standard protocol for such connections in the
library world is Z39.50 (More formally known as ISO 23950: "Information
Retrieval
(Z39.50):
Application Service Definition and Protocol
Specification") [HREF21].
Z39.50 has not
been taken up as quickly as its proponents had hoped (for a variety of
reasons too complex to cover here). As a result the Z39.50 Next
Generation group (ZNG) have been working on more modern and lightweight
protocols to achieve much of the original Z39.50 functionality. These
newer protocols are called SRU (Search/Retrieve over URL) [HREF22]
and SRW
(Search/Retrieve for Web Services) [HREF23].
ARROW decided
to support both
SRW and SRU connections to make it possible for real-time searching
through things like the portlet technology being developed by
education.au (HREF24).
5. Building Site
5.1 Where we are now
Up to now, this paper has described activities that have already taken
place. We have now come up to the time of writing this paper. The point
of all the
preceding work is, of course, to actually build something. The ARROW
project started to receive funds in late January 2004. Since that time
we have:
- appointed a Project Manager (Geoff Payne, previously with the
AARLIN project [HREF43] at Latrobe
University)
- appointed a company to design an ARROW brand, marketing materials
and a website [HREF25]
- made significant progress on software selection (NOTE: see
comments under 4.1.4 above)
- turned the original briefing document into a set of technical
requirements
5.2 Plans for rest of
this year
Over the rest of this year, the ARROW project will:
- Select repository software (NOTE: see comments under 4.1.4 above)
- Define the requirements for the content workflow and management
modules
- Create/acquire/integrate software to meet these requirements
- Start work to acquire content within the partner institutions
- Develop the search/exposure services required
6. Open House!
6.1 When are we going
to be open for business?
We hope to have functional software available by the end of 2004. This
would be the Open House date, and from that point onwards we will be
loading content and providing a semi-production service. Initially this
service will only be available at the four project partner
institutions. We have made an allocation in the budget in year 3 (2006)
to roll out the ARROW initiative to up to 10 other institutions across
Australia. We may be able to start this phase earlier if all goes well,
but we don't want to commit to this at such an early stage.
6.2 Plans for the
future
The initial round of DEST funding runs out at the end of 2006. One of
the DEST requirements was that successful projects should address the
issues of sustainability. Both DEST and ARROW are keen to see the
initiative continue beyond the end of 2006 and are thinking hard about
how to ensure long-term viability for the project (assuming it is
successful). It is far too early to say what these plans might be, but
one idea that we keep playing with can be summarised as "Embedding
ARROW into the things that universities have to do anyway".
7. Conclusions
The process of developing the architecture for ARROW has been a
constant interaction between our vision for what we wanted to do and
what the software might make possible. Sometimes the software
possibilities constrained the vision. Sometimes they expanded it. But
the end result is, we hope, a flexible architecture that will enable us
to meet the DEST requirements to make Australian research more visible.
And, who knows, perhaps ARROW will end up becoming something more. In
our less-guarded moments we (the ARROW Project Team) like to talk about
ARROW becoming part of
the fundamental infrastructure of higher-education in Australia.
Perhaps it will, but we have quite enough on our plates already, and
our first challenge is to succeed with the initial (and quite daunting
enough)
list of deliverables.
The
architectural
work
described in this paper is just the first step.
8. Acknowledgement
The ARROW Project is sponsored as part of the Commonwealth Government's
Backing Australia's Ability [HREF42].
References
DEST (Australian Commonwealth Department of
Education, Science and Training) (2002), Research Information
Infrastructure Framework for Australian Higher Education. The Final
Report of the Higher Education Information Infrastructure Advisory
Committee (Systemic Infrastructure Initiative). [HREF4]
DEST (2003a), Information Infrastructure -
Call
for Proposals 2003. [HREF5]
DEST (2003b), Information Infrastructure -
Outcomes of Selections Process.
[HREF6]
Harboe-Ree, C., Sabto, M. and Treloar, A.
(2004), "The
library as digitorium: new modes of creation, distribution and
access", Proceedings of VALA 2004, Melbourne, February. [HREF1]
Harboe-Ree, C. and Treloar, A. (2004),
"Connecting the Dots Downunder: Towards An Integrated Institutional
Approach To Digital Content Management", High Energy Physics Libraries
Webzine, issue 9, March. [HREF44]
Clifford A. Lynch, "Institutional
Repositories: Essential Infrastructure for Scholarship in the Digital
Age" ARL, no. 226 (February 2003): 1-7. [HREF7]
Open Society Institute (2004), A Guide to
Institutional Repository Software version 2.0. [HREF8]
Payette, Sandra & Staples, Thornton,
"The Mellon Fedora Project: digital library architecture meets XML and
web services", Sixth European Conference on Research and Advanced
Technology for Digital Libraries. Lecture notes in computer science,
vol. 2459. Springer-Verlag, Berlin Heidelberg New York (2002) 406-421. [HREF9]
Rogers, S.A., "Developing an institutional
Knowledge Bank at Ohio State University: from Concept to Action Plan",
in portal: Libraries and the Academy, January 2003. [HREF2]
Staples, Thornton, Wayland, Ross &
Payette, Sandra, "The Fedora Project: an open-source digital object
repository management system", in D-lib Magazine, April 2003. [HREF10]
Van de Sompel, H., Young, J. and Hickey, T.
(2003), "Using the OAI-PMH ... Differently", D-Lib Magazine,
July/August. [HREF20]
Hypertext References
- HREF1
- http://www.vala.org.au/vala2004/2004pdfs/21HrSaTr.pdf
- HREF2
- http://www.lib.ohio-state.edu/Lib_Info/rogersKBdoc.pdf
- HREF3
- http://www.colis.mq.edu.au/
- HREF4
- http://www.dest.gov.au/highered/otherpub/heiiac/exec_summary.htm
- HREF5
- http://www.dest.gov.au/highered/research/proposal.htm#1
- HREF6
- http://www.dest.gov.au/highered/research/outcomes2003.htm
- HREF7
- http://www.arl.org/newsltr/226/ir.html
- HREF8
- http://www.soros.org/openaccess/software/
- HREF9
- http://www.fedora.info/documents/ecdl2002final.pdf
- HREF10
- http://dlib.org/dlib/april03/staples/04staples.htm
- HREF11
- http://www.dest.gov.au/Ministers/Media/McGauran/2003/10/mcg002221003.asp
- HREF12
- http//www.eprints.org
- HREF13
- http://adt.caul.edu.au
- HREF14
- http://etd.vt.edu/
- HREF15
- http://www.rmitpublishing.com.au/
- HREF16
- http://www.dspace.org
- HREF17
- http://libraries.mit.edu/dspace-mit/
- HREF18
- http://dspace.org/federation/index.html
- HREF19
- http://dspace.org/faqs/index.html#content
- HREF20
- http://www.dlib.org/dlib/july03/young/07young.html
- HREF21
- http://lcweb.loc.gov/z3950/agency/
- HREF22
- http://www.loc.gov/z3950/agency/zing/srw/sru.html
- HREF23
- http://www.loc.gov/z3950/agency/zing/
- HREF24
- http://www.educationau.edu.au/
- HREF25
- http://arrow.edu.au/
- HREF26
- http://www.naa.gov.au/recordkeeping/preservation/digital/xml_data_formats.html
- HREF27
- http://sts.anu.edu.au/downloads/APSR.pdf
- HREF28
- http://www.melcoe.mq.edu.au/projects/MAMS/index.htm
- HREF29
- http://andrew.treloar.net/
- HREF30
- http://www.its.monash.edu.au/
- HREF31
- http://www.monash.edu.au/
- HREF32
- http://arrow.edu.au/
- HREF33
- http://lib.monash.edu.au/
- HREF34
- http://home.earthlink.net/~ritter/tiff/
- HREF35
- http://www.libpng.org/pub/png/
- HREF36
- http://www.w3.org/Graphics/SVG/
- HREF37
- http://www.w3.org/AudioVideo/
- HREF38
- http://www.state.ma.us/mgis/mrsid.htm
- HREF39
- http://www.staroffice.com
- HREF40
- http://www.openoffice.org/
- HREF41
- http://www.fedora.info/history.shtml
- HREF42
- http://backingaus.innovation.gov.au/
- HREF43
- http://aarlin.edu.au/
- HREF44
- http://library.cern.ch/HEPLW/9/papers/1/
Copyright
© Dr Andrew Treloar, 2004. The author assigns to Southern Cross
University and other educational and non-profit institutions a
non-exclusive licence to use this document for personal use and in
courses of instruction provided that the article is used in full and
this copyright statement is reproduced. The author also grants a
non-exclusive licence to Southern Cross University to publish this
document in full on the World Wide Web and on CD-ROM and in printed
form with the conference papers and for the document to be published
on mirrors on the World Wide Web.