ARROW Targets: Institutional Repositories, Open-Source, and Web Services

Dr Andrew Treloar [HREF33], Project Manager, Strategic Information Initiatives, Information Technology Services [HREF34] & ARROW [HREF1] Technical Architect & Adjunct Librarian, Monash University Library [HREF35]. Building 3A, Monash University [HREF36], Victoria, 3800. Email: Andrew.Treloar@its.monash.edu.au

Abstract

The Australian Research Repositories Online to the World (ARROW) project was funded by the Australian Commonwealth Department of Education, Science and Training in 2004 to investigate institutional research repository issues. This paper describes the software decisions made by ARROW, and some of the consequences of those decisions. It begins by providing some context for ARROW, before detailing why the project selected the Fedora repository software as its underlying storage layer and VTLS as its development partner. It also discusses ways in which other pieces of software are being used/integrated into the overall ARROW offering. The paper then turns to future software development activity. These revolve around ARROW-funded open-source web  services, and other possible web service opportunities.  The paper concludes by  encouraging  other  repository initiatives to work together around collaborative development of services of shared interest.

1. ARROW Overview

1.1 FRODO Program

In June of 2003, the Australian Commonwealth Department of Education, Science and Training issued a call for proposals to "further the discovery, creation, management and dissemination of Australian research information in a digital environment" [DEST (2003a)]. This sought to "fund proposals which help promote Australian research output and help to build the Australian research information infrastructure, through the development of distributed digital repositories and common technical services that manage access and authorisation to these."

In response to this call, 14 projects were submitted of which four were funded [DEST (2003b) ]. The successful projects were:

These four projects were funded for a combined total of A$12 million over a period of 3 years, with funding commencing at the start of 2004 [HREF5]. This group of projects are now referred to by DEST as the Federated Repositories of Digital Objects (FRODO) project cluster.

The focus of this paper will be on the selection of software for ARROW and the consequences and opportunities arising from this decision.

1.2. ARROW Design Brief

The original design brief was encapsulated in the Summary section of the ARROW Bid document sent to DEST (public version of bid available at HREF6]. This read:

"The ARROW project (ARROW) will identify and test a software solution or solutions to support best-practice institutional digital repositories comprising e-prints, digital theses and electronic publishing. A wide range of digital content types will be managed in these repositories. The NLA will develop a repository and associated metadata to support independent scholars (those not associated with institutions). A complementary activity of ARROW is the development and testing of national resource discovery services (developed by the NLA) using metadata harvested from the institutional repositories, and the exposing of metadata to provide services via protocols and toolkits. This will include a potential path for the redevelopment of the Australian Digital Theses (ADT) metadata repository incorporated into the NLA’s national resource discovery services.

Initially ARROW will be tested in the four partner institutions, prior to it being offered more widely across the higher-education sector. The solution will be open-standards based, or will support open standards, and will facilitate interoperability within and between participating institutions."

An earlier paper [Treloar 2004] describes the thinking behind this brief and the various ARROW components in more detail.

1.3 ARROW Implementation Philosophy

A core part of the decision process was to make decisions about the use of open source and open standards. The first decision was an easy one: it was a condition of the funding from DEST that any software developed using project funds had to be made available as open source. This ensured that the Australian (and, ultimately, the global) research communities got the best value from the investment. The second decision also turned out to be an easy one. The core design group agreed that the best approach was to adopt open standards wherever possible when specifying software functionality, data formats or interfaces.

1.4 Obligatory Layered Architecture Diagram

The ARROW architects decided to conceive of the software required in terms of a layered architecture. The notion of a layered architecture is not particularly controversial. Such architectures have been preferred since at least the days of the International Standards Organisation Open Systems Interconnect seven-layer reference model for network services. In the Digital Library field these sorts of high-level models are so common that the project group took to referring to 'obligatory' layered architecture diagrams. Figure 1 therefore is the OLAD (Obligatory Layered Architecture Diagram) for ARROW.

ARROW Layered Architecture

Figure 1: Obligatory Layered Architecture Diagram for ARROW.

2. ARROW Software

2.1 Repository foundation: Fedora

The project recognised very early on that the decision on the repository was foundational. The choice of repository technology would determine the functionality ARROW could provide and the ways it could provide it. Much of the latter half of 2003 was spent in careful analysis of available candidates, looking in particular at the list and functionality checklist contained in [Open Society Institute, 2004].

As a result of this work, the project decided to select Fedora - the Flexible Extensible Digital Object and Repository Architecture. Despite its name,  Fedora [HREF7] is both a software platform and an architecture . Note that this Fedora is both different to and predates the use of the name by RedHat. The architecture came out of Digital Library work done in the computer science field in the late 1990s [Payette and Staples (2002)]. The history of the Fedora repository software is described on its website as follows:
"In the summer of 1999 ... the [University of Virginia] Library's research and development group discovered a paper about Fedora written by Sandra Payette and Carl Lagoze of Cornell's Digital Library Research Group. Fedora was designed on the principle that interoperability and extensibility is best achieved by architecting a clean and modular separation of data, interfaces, and mechanisms (i.e., executable programs). With Cornell's help, the Virginia team installed the research software version of Fedora and began experimenting with some of Virginia's digital collections. Convinced that Fedora was exactly the framework they were seeking, the Virginia team reinterpreted the implementation and developed a prototype that used a relational database backend and a Java servlet that provided the repository access functionality. The prototype provided strong evidence that the Fedora architecture could indeed be the foundation for a practical, scalable digital library system. In September of 2001 The University of Virginia received a grant of $1,000,000 from the Andrew W. Mellon Foundation to enable the Library, in collaboration with Cornell University, to build a sophisticated digital object repository system based on the Flexible Extensible Digital Object and Repository Architecture (Fedora). The Mellon grant was based on the success of the Virginia prototype, and the vision of a new open-source version of Fedora that exploits the latest web technologies. Virginia and Cornell have joined forces to build this robust implementation of the Fedora architecture with a full array of management utilities necessary to support it." [HREF8].

Increasingly, the term Fedora (which was first used over 5 years ago as an acronym for the architecture) is now being used to refer to this software implementation. In this latter sense, Fedora is "an open source, digital object repository system using public APIs exposed as web services." [Staples, Wayland and Payette (2003)]. Fedora can best be thought of as services-mediation infrastructure, rather than an off-the-shelf application. It can use web services to call other services as well as expose its own services using web services standards. Key to the Fedora architecture is its underlying object-based model. Fedora stores digital content objects, either as datastreams contained within the repository or as links to external resources. It also stores disseminators, which are ways to render these digital content objects. The software maintains bindings between content objects and their disseminators. Each object has a default disseminator, but may be able to be disseminated in other ways. This architecture is extremely flexible, and provides significant advantages as a platform on which to build other applications [Lagoze, et al (2005)].

Fedora was selected after considering its merits relative to DSpace and a number of other candidates for a number of reasons:
However, it should be said that this is an area where a number of players, both open-source and proprietary, are moving very quickly. As a result, ARROW agreed to review its software decision every 12 months.

Fedora has been through a number of versions since the ARROW project started working with it:

Fedora is now a mature repository engine, with deep and rich functionality applicable to a range of problem domains. But it does not provide an out-of-the-box experience. When one downloads Fedora, one gets the engine only. To do something useful with Fedora, someone has to write software.

2.2 Outsourcing the software development

The other major decision to be made, therefore, was how ARROW would develop the software that could be built on top of Fedora to meet the requirements of the ARROW project. The original bid to DEST had envisaged that the project would hire its own software developers to write the necessary software. Geoff Payne, the ARROW project manager, realised that a potentially far better option would be to engage a developer, preferably one with experience with our preferred repository.  After a good deal of exploration and negotiation, the project announced [HREF9] in July 2004 that it was partnering with VTLS [HREF10] who already had a product on the market called VITAL [HREF11] that was built on top of Fedora. ARROW has licensed VITAL and is working with VTLS to extend the functionality of Fedora by commissioning a series of Open-Source Web Services (see section 3.1 on Open Source Web Services below).

The relationship between ARROW and VTLS is a true partnership, with significant transfers of IP in both directions. ARROW has the ability to influence the direction of future versions of VITAL, as well as getting access to pre-beta code for testing and critiquing.

This decision to partner with VTLS had a number of advantages:

2.3 VITAL

The original VITAL product was primarily aimed at digital image collections. VITAL 1.0 consisted of Fedora 1.2, a basic Web front end for searching (the VITAL Access Portal), and a Windows application for managing the repository contents (the VITAL Manager). The intention was that a small back-office staff would use the Manager to ingest and manage image files which would then be exposed to a much wider audience using the Access Portal.

The requirements of the ARROW Project were much broader than this. In particular, ARROW wanted support for:
VTLS have been gradually adding this functionality into successive releases of the software. VITAL 1.2 (codenamed Bandicoot) was released in November 2004, and VITAL 1.3 ( codenamed Bettong) is going through VTLS internal testing in March, prior to release in April 2005. VITAL 2.0 (codenamed Bilby), due for release in June/July 2005, will deliver all of the functionality originally specified by ARROW. VITAL 2.1 (tentatively codename Bobuck) has been delayed beyond its original ship date in order to draw on functionality that will only be delivered in Fedora 2.1 around mid-2005.

2.4 Additional software

The VITAL software was originally intended to meet the needs of the E-Print, E-Theses and NLA Repository for non-university research modules in the ARROW OLAD. While working with VTLS on the development of VITAL, the ARROW project has also been investigating other software options for parts of the ARROW offering.

National Research Discovery Service

One of the offerings in the Search/Exposure layer is the ARROW National Research Discovery Service. This has been written by the National Library of Australia as part of their contribution to the ARROW consortium and is hosted on their Teratext system. It uses OAI/PMH to harvest metadata from a number of different institutional research repositories at Australian universities. These repositories use a range of software (e-prints.org software, DSpace and Fedora) but all expose their metadata for harvesting. This service is now live and available either through a link from the ARROW website or directly [HREF12].

Open Access Publishing

For the Open Access Publishing module, the ARROW project has decided to use an existing piece of software that offers good functionality and a very well designed web interface. This is the Open Journal System [HREF13] available from the Public Knowledge Project [HREF14] at the University of British Columbia. Version 1.1.9 of OJS (the version available as of the time of writing) is restricted to one journal per server, but the next major version should remove this restriction. Swinburne is the lead site within the ARROW consortium for working with and adapting the OJS. The ARROW OLAD shows  the OA Publishing module sitting on top of a common repository. This was the original thinking behind the bid, but at present, OJS comes with its own filesystem based repository. The long term plan is to integrate the workflow and exposure tools of OJS with the underlying Fedora repository engine. Ron Jantz at the Rutgers University Library Scholarly Communication Center [HREF15] has already done some work along these lines, and the ARROW project plans to partner with him on this.

ADT

Another piece of software that is being used by ARROW is the Australian Digital Theses (ADT) Program Content Management Software. The Australian Digital Theses Program is owned by the Council of Australia University Librarians. The main focus of the ADT is, and has always been, the ADT Theses Discovery Service. In order to provide content to discover, the ADT had to develop a local software solution for participants, which they call the ADT Content Management Software. This software is  based on the original 1997 ETD software from Virginia Tech in the US but with two local, and quite significant iterations of enhancements.

The ADT Program is not dependant on any particular software - as long as the prescribed metadata is generated for harvesting into the national ADT Central Metadata Repository. The Current ADT-ARIIC Expansion and Redevelopment Project (from the same FRODO funding program that is financing the ARROW project) is working on expanding the metadata repository to build a comprehensive metadata repository about Australian research theses - whether they lead to e-theses or not. The ADT Program is also set to expand by including New Zealand. This is the first step in building a regional program, and will see a consequential name change to the Australasian DTP.

ARROW's original plans in this area were to develop software that could be used as an alternative to the ADT Content Management offering.  This is still the plan, but some of the functionality is dependent on the next release of Fedora. As an interim step therefore, the ARROW project is proposing to use the workflow and front-end of the ADT software to prepare file packages for ingest into ARROW repositories. In the longer term, the valuable work of the ADT Program and the customisations they have made to their software will inform ARROW as to the necessary functionality required for ETD management in an Australian context.

3. Future plans

This paper has described ARROW's current and planned activities around its software selection and development. What about plans further out?In particular, what are some of the possible consequences arising from the choice of Fedora and the approach is ARROW is taking?

3.1 Opening up FEDORA

Versions of Fedora up to and including 1.2.1 were open source, but closed development. That is, it was not possible for anyone outside the core Fedora group to contribute code, and there was no mechanism for co-ordinating additional services around the Fedora core.
Fedora Modular Service Framework

Figure 2: Fedora services framework. Source - Lagoze, et al (2005)

The Fedora team have been thinking about how best to open up Fedora as part of the work around planning the next development phase. At a recent meeting of the Fedora Development Consortium (of which ARROW is a member) they proposed the model shown in Figure 2. At the core of this new framework is the Fedora Repository Service.  Other services exist around the core to provide  functionality that  is not considered a fundamental function of a repository. 

Any number of services can  be developed to collaborate with the core Fedora Repository Service.  In the diagram,  there are three collaborating services around the core:  the Fedora OAI provider, a  Fedora Search service, and a Fedora Preservation Monitoring Service.   The framework approach anticipates that new services will be added over time.   Outside of the boundaries of the Fedora framework are external services that can  either call upon Fedora services, or that Fedora can leverage in some way... Services outside the framework are typically general-purpose services, or organization specific services that call upon Fedora as an underlying repository for digital content.    Prior to version 2.0 of Fedora, all Fedora-related functionality was built into the core Fedora Repository Service.  As of version 2.0, the Fedora Service Framework was defined to move the Fedora architecture in a direction where new services can  easily be developed and plugged into the Framework. This is consistent with general  trends developing in web services technology and enterprise application architectures  in which formerly tightly-integrated systems are broken apart into atomic, modular  services that can be flexibility aggregated into different multi-service compositions.  At the time of writing, Fedora is migrating to the new service framework approach. [Lagoze, et al (2005)]

So what might some of these external web services be?

3.1 Open-source ARROW web services

It was a condition of the DEST funding for ARROW that any commissioned software would need to be made available as open source. ARROW has partnered with VTLS to commission them to write a series of ARROW-funded modules.  These ARROW-commissioned modules will call Fedora using the existing APIs and will also expose themselves as a series of Web-Services (consistent with the above model). VTLS will be able to build products on top of these new ARROW-commissioned modules if they wish and future releases of the VITAL product will almost certainly use these modules.

The first of these web services to be made available will be services to support SRU/SRW searching of ARROW repositories. These are as part of the ARROW Search/Exposure layer. The other three functions in this layer are the National Research Discovery Service (discussed above), the Open Archives Initiative-Protocol for Metadata Harvesting [Van de Sompel, et. al. 2003] and Web search engines. These are all  'proxy' search services. That is, they collect proxy records and place them in a database where they can be searched. Such proxy systems run the risk of always potentially being out of date (if only slightly). We therefore wanted to make it possible for other search services to connect directly to ARROW repositories and run interactive searches. The standard protocol for such connections in the library world is Z39.50 (More formally known as ISO 23950: "Information Retrieval (Z39.50): Application Service Definition and Protocol Specification") [HREF16]. Z39.50 has not been taken up as quickly as its proponents had hoped (for a variety of reasons too complex to cover here). As a result the Z39.50 Next Generation group (ZNG) have been working on more modern and lightweight protocols to achieve much of the original Z39.50 functionality. These newer protocols are called SRU (Search/Retrieve over URL) [HREF17] and SRW (Search/Retrieve for Web Services) [HREF18]. ARROW decided to support both SRU and SRW connections to make it possible for real-time searching through things like the portlet technology being developed by education.au [HREF19].

The SRU/SRW web services will appear in VITAL 2.0, and will also be made available to the Fedora community for their use.

3.2 Other web service opportunities

There are a number of other potential web services that could be developed consistent with the above model. A number of these would be applicable to different repository initiatives and software platforms. If these can be architected and implemented sufficiently cleanly and generally, then repository software that is web-services enabled could make use of these common modules. Such an approach would mean greater critical mass coalescing around a smaller number of choices for such services. This would be more efficient, and would contribute to greater quality and functionality for these services. So, what are some candidates?

Authorisation/authentication

This might be required to perform administrative functions or access particular resources. Within an institution it will probably be possible to use existing mechanisms (such as an LDAP lookup against an enterprise directory). Across institutions, the situation is a bit more problematic. The preferred solutions worldwide seem to be coalescing around something like Shibboleth [HREF20]. The ARROW project will be working with the MAMS project (funded in the same DEST round) to document possible ARROW-MAMS interactions, validate common eXtensible Access Control Markup Languages (XACML) [HREF32] profiles (once initial use-cases are agreed with MAMS), and testing the MAMS solution. Across the wider institutional repository community there is an opportunity to agree on the reciprocal use of privacy-preserving authentication, and converge on a common set of XACML profiles. 

Object processing on ingest

All repositories should on ingest be able to validate objects as the type they purport to be (for consistencies sake, if nothing else). They should also be able to extract technical and descriptive metadata where this is either contained in the object or derivable from it. Here, the preferred solution seems to be to use JHOVE (the JSTOR/Harvard Object Validator and Extractor) [HREF21]. ARROW will be working with JHOVE to both make use of existing plugins, expand their functionality, and provide additional plugins for new file types. The web services opportunity here is to agree to use the same plugin framework, and put effort into enhancing the functionality. One example of this activity is the National Library of New Zealand who are extending the JHOVE PDF plugin to extract preservation metadata.

Metadata consistency

In order to provide the best possible user experience, repositories need to put in place mechanisms to ensure quality metadata is attached to objects. This might involve actions like enforcing the appropriate schema for a given object type, managed lookups for things like names, or providing controlled vocabularies (thesauri, classification schemes). ARROW will initially be using the Australian Standard Research Classification (ASRC) [HREF22] for subject headings, while also investigating other ways to improve metadata consistency. For personal name consistency, it should be possible to use institutional directories to do a lookup. Metadata validation and standard metadata authority files to enhance searching using OAI-PMH will be less simple to arrange. Ultimately, it would be good to be able to query metadata schemas through maintenance agencies (for standardised schemas) and manage schema updates in some consistent way.

Search exposure

In an environment where there are more repositories, and an increasing number of different types, it will become more and more important to provide a standards-based way for information gateways, as well as other repositories, to query repository contents directly. It will also be desirable to provide a standard way to harvest from repositories to support a single search gateway; this is preferable to federated search because it provides a better user experience. ARROW will be supporting OAI-PMH  and has commissioned (as described) the development of an SRU/SRW web service on top of Fedora. Google has also agreed to work with ARROW to expose repository content via Google Scholar. There are a number of possible opportunities for other work. One is to extend OAI-PMH to harvest content as well as metadata. The modOAI project is already looking at this. Another is an agreement for repository projects to use SRU/SRW for searching across their different repositories.

Collaboration between repositories

Finally, there is the issue of collaboration between different repository initiatives and software development projects. There is a strong need for interoperability between these initiatives (as discussed above). It would also be good to avoid radial re-capitulation (re-inventing the wheel) as well as learning from each other’s progress so far. ARROW is working with the Fedora Development Consortium, the National Science Digital Library and the Australian Partnerships for Sustainable Repositories (APSR) projects on these issues. There are a number of possible outcomes. One is a registry of standard content models (definitions of types of content and associated object structures and metadata). Another is to influence the emergence of de facto standards in the absence of existing practice. A third is to achieve agreement on a standard framework for re-usable web services. None of these will be easy, but all would be valuable.

4. Conclusions

The decision to use Fedora and build on web-services has stood the test of time. Fedora is providing ARROW with enormous flexibility, but also with lots of decisions that need to be made. These decisions are being encoded in software and content models, and are being implemented by the ARROW partners. Later this year we hope to be able to offer these to other potential ARROW partners. The plans for shared web service development across repository projects are a bit less concrete, but the ARROW project would be delighted to talk to other projects interested in discussing this. The institutional repository movement is just gathering momentum, and there are no doubt interesting times ahead. The ARROW project looks forward to playing its small part in this exciting field.

Acknowledgement

The ARROW Project is sponsored as part of the Commonwealth Government's Backing Australia's Ability [HREF23].

References

DEST (Australian Commonwealth Department of Education, Science and Training) (2002), Research Information Infrastructure Framework for Australian Higher Education. The Final Report of the Higher Education Information Infrastructure Advisory Committee (Systemic Infrastructure Initiative). [HREF24]

DEST (2003a), Information Infrastructure - Call for Proposals 2003. [HREF25]

DEST (2003b), Information Infrastructure - Outcomes of Selections Process. [HREF26]

Lagoze, et al. (2005). Fedora: An Architecture for Complex Objects and their Relationships. Journal of Digital Libraries Special Issue on Complex Objects. In press. Preprint available at  [HREF27]

Open Society Institute (2004), A Guide to Institutional Repository Software version 2.0. [HREF28]

Payette, Sandra & Staples, Thornton, "The Mellon Fedora Project: digital library architecture meets XML and web services", Sixth European Conference on Research and Advanced Technology for Digital Libraries. Lecture notes in computer science, vol. 2459. Springer-Verlag, Berlin Heidelberg New York (2002) 406-421. [HREF29]

Staples, Thornton, Wayland, Ross & Payette, Sandra, "The Fedora Project: an open-source digital object repository management system", in D-lib Magazine, April 2003. [HREF30]

Treloar, A. (2004),  "Building an Institutional Research Repository from the Ground Up: The ARROW Experience". In Proceedings of AusWeb04, the Tenth Australian World Wide Web Conference, Southern Cross University Press, Southern Cross University, July. [HREF36]

Van de Sompel, H., Young, J. and Hickey, T. (2003), "Using the OAI-PMH ... Differently", D-Lib Magazine, July/August. [HREF31]

Hypertext References

HREF1
http://arrow.edu.au/
HREF2
http://adt.caul.edu.au/
HREF3
http://www.apsr.edu.au/
HREF4
http://www.melcoe.mq.edu.au/projects/MAMS/index.htm
HREF5
http://www.dest.gov.au/Ministers/Media/McGauran/2003/10/mcg002221003.asp
HREF6
http://arrow.edu.au/docs/files/ARROW%20project.pdf
HREF7
http://www.Fedora.info/
HREF8
http://www.Fedora.info/history.shtml
HREF9
http://arrow.edu.au/docs/files/ARROW-VITAL.pdf
HREF10
http://www.vtls.com/
HREF11
http://www.vtls.com/Products/vital.shtml
HREF12
http://search.arrow.edu.au/
HREF13
http://www.pkp.ubc.ca/ojs/
HREF14
http://www.pkp.ubc.ca/
HREF15
http://www.scc.rutgers.edu/scchome/
HREF16
http://lcweb.loc.gov/z3950/agency/
HREF17
http://www.loc.gov/z3950/agency/zing/srw/sru.html
HREF18
http://www.loc.gov/z3950/agency/zing/
HREF19
http://www.educationau.edu.au/
HREF20
http://shibboleth.internet2.edu/shib-intro.html
HREF21
http://hul.harvard.edu/jhove/
HREF22
http://www.abs.gov.au/Ausstats/abs@.nsf/0/51c2bdb99eba43e8ca256889001e9d9e?OpenDocument
HREF23
http://backingaus.innovation.gov.au/
HREF24
http://www.dest.gov.au/highered/otherpub/heiiac/exec_summary.htm
HREF25
http://www.dest.gov.au/highered/research/proposal.htm#1
HREF26
http://www.dest.gov.au/highered/research/outcomes2003
HREF27
http://www.arxiv.org/abs/cs.DL/0501012
HREF28
http://www.soros.org/openaccess/software/
HREF29
http://www.Fedora.info/documents/ecdl2002final.pdf
HREF30
http://dlib.org/dlib/april03/staples/04staples.htm
HREF31
http://www.dlib.org/dlib/july03/young/07young.html
HREF32
http://xml.coverpages.org/xacml.html
HREF33
http://andrew.treloar.net/
HREF34
http://www.its.monash.edu.au/
HREF35
http://www.lib.monash.edu.au/
HREF36
http://ausweb.scu.edu.au/aw04/papers/refereed/treloar/

Copyright

© Dr Andrew Treloar, 2005. The author assigns to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The author also grants a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.