Andrew Treloar [HREF1], {Director Information Management and Strategic Planning | DART Project Architect | ARROW Technical Architect}, Information Technology Services [HREF2] , PO Box 3A, Monash University [HREF3], Victoria, 3800. Andrew.Treloar@its.monash.edu.au
The Dataset Acquisition, Accessibility, and Annotation e-Research Technologies (DART) Project is a Commonwealth Department of Education, Science and Training (DEST) funded project to develop and assess new e-research collaboration tools and infrastructure. This paper begins by describing the context in which the DART bid was submitted. It then goes on to describe the overall architecture for DART and its various work packages. The paper then outlines some of the underlying theory on which the DART bid and project are based. Finally, the paper concludes by examining progress to date on the project.
In early 2005, the Australian Government called for proposals
for collaborative projects that brought together consortia to improve
accessibility to Australian research. The call for proposals identified
four areas of interest:
The call for proposals also identified a number of key trends that are changing the way in which research is conducted and its outputs consumed. These included new technologies, such as computer simulations, synchrotrons and sensor networks, the expanding size of the datasets on which research is based, increasing volumes of information generated through research. greater complexity, and the recognition of the need to work across traditional disciplinary, institutional and national borders. To this one might add a growth in research practices that are producing a paradigm change in the types of research that this new large-scale computing/data management environment can support. These emerging research practices are intensely collaborative (often involving trans-national teams), require high-quality network access, and are data and simulation-intensive.
These changes first became evident in high-energy physics, science and engineering (Atkins, et. al. 2003) but are now also becoming apparent in the social sciences and humanities (Waters, 2003). Some disciplines have good practices around, and support for, lodgement of datasets as part of publication while other disciplines are just starting to explore this area. The role of datasets (historical, sensor-produced and simulation-derived) is also becoming increasingly important to a wide range of disciplines.
These changes in, and pressures on, research practices, are occurring at the same time as changes in the communication of research results. Communication through scholarly journals and the archiving of those journals has been the mainstay of a range of research communities for the last three centuries. The advent of the World Wide Web has made a whole range of new forms of publishing possible (Treloar, 1999), and the past decade in particular has seen a great deal of experimentation with new journal forms and new publishing models. More recently, the open access movement has been particularly vigorous in proposing solutions to a number of concerns that have become obvious in the current system of scholarly communication: the serials crisis (the increasing subscription costs of scholarly journals), the relative inaccessibility of paper-based archival journals (the need to physically examine the paper based publication), and the permissions crisis (the way in which publishers impose restrictions on the use of material published under their imprints). These solutions can be seen as a response to the potentials latent in the advent of the World Wide Web.
In May 2005 a joint CNI-JISC-SURF invitational conference [HREF4]
was
held in Amsterdam with the title "Making the strategic case for
institutional repositories”. This conference emphasised the
potential for repositories to move beyond the kinds of traditional
publications that have been the concern of the open access movement to
support innovative new forms of research and research output exposure.
Some of the possibilities discussed were:
All of these new possibilities also present new challenges in lifecycle management, attribution and provenance of the full set of research outputs, not just the conventional formal publication.
It was against this context that in April and May of 2005, the Dataset Acquisition, Accessibility and Annotations e-Research Technologies (DART) bid came into existence. The DART request for funding built on the work already done in the ARROW project in establishing the basis for institutional research publication repositories, as well as antecedent activity at each of the three DART partners (James Cook University [HREF5], Monash University [HREF6] and the University of Queensland [HREF7]). It did this by extending its concerns into the areas of large datasets and sensors, as well as annotation technologies and collaborative, composite documents. In particular, the DART proposal sought to investigate the most appropriate response to the challenges inherent in:After a competitive bid process in 22 August 2005, the Hon Dr
Brendan Nelson MP, Minister for Education, Science and Training,
announced funding for nine projects involving over 30 Australian
universities. The DART project was one of those successfully funded,
receiving its requested $3.23 million. Many of the funded projects
comply with more than one of the areas identified and support many of
the themes and problems that are emerging through the consultations for
NCRIS and e-Research Committees. The funded projects are now
known collectively as the Managed Environments for Research Repository
Infrastructure (MERRI [HREF8])
Projects.
The DART project seeks to support the evolving new paradigm
for e-Research by addressing issues across the entire research
continuum from the creation of the original research problem through to
the pluralisation of the resulting work, its annotation by others, and
its reuse in new research. It does so, within a holistic framework, by
tackling issues related to the various new forms (of data, publication,
etc.) discussed above. It also recognises the need for data curation,
which is defined as follows by the UK Digital Curation Centre:
The specific goal of the DART project is to support and enable researchers, end-users, and appropriate computer systems to manage the creation and collection of data and to gain greater access to data and documents, by gathering, managing and archiving data and documents and managing their access, so that researchers are more easily able to perform their work and do so at a much higher level of insight and productivity than was previously possible, and so that the Australian public has greater visibility of, and access to, publicly funded research.
It is doing this by writing new software, enhancing existing software, performing systems integration, creating demonstrator implementations, and engaging with research groups.
Figure 2 shows the rationale behind the DART project. This
figure draws on the work described in Van de Sompel (2004), which
re-conceptualises the processes that take place in scholarly
communication, and extends this by adding to this model the research
process itself, as well as the process of annotation. Figure 2 shows
the current situation, the situation with the innovations that will be
delivered by the DART project, the benefits for researchers and the
benefits for the general public.

Figure 2: Improvements to Scholarly Processes arising from the
DART
Project
In the DART proposal, dynamic publications are one of a number of digital objects to be included in the consideration, because they represent a significant emerging form of communication among the collaborating researchers. This aspect was not considered in the original ARROW project as the focus at that time was more on the deposit of, and access to, published articles. In addition, the DART project also deals with the requirements for large-scale data collection and curation, which was completely outside the ARROW project scope. Here the DART project can also build on the work being undertaken in the UK by the eBank UK project [HREF10] which is investigating the issues surrounding the provenance, use and reuse of original data for research and learning purposes as well as DSTC’s PANIC [HREF11] and FUSION projects.
Figure 2 answers the question of why DART is important. Figure
3 shows how the project hopes to produce these benefits. In the
uppermost layer are researchers, readers and computers programs. The
middle layer shows the proposed repositories (including traditional
publications as research outputs, and raw data) and the data flows
between them and the datasets in the lowest layer. The lowest layer
shows the data sources and their associated storage. The figure has
been annotated to indicate the work packages that are involved for each
component. For instance, the process of editing dynamic collaborative
documents is described in work package AA4. Details of the work
packages can be found on the DART website [HREF12].

Figure 3: DART high-level architecture (codes indicate relevant DART work packages)
The DART project thus substantially extends and enhances the
focus of the DEST-funded FRODO [HREF13]
projects on research outputs to include the needs of
dataset creation, acquisition, management, and curation, as well as
providing support for collaborative research practices. This aspect of
research information management is an opportunity afforded by the
advent of the World Wide Web to transform in a fundamental manner the
traditional scholarly publication environment to include
pre-publication collaboration, as well as access to data sets,
software, and/or commentaries, and annotations, thus providing a much
richer research collaborative fabric than was imaginable even 10 years
ago.
The DART project has been structured as a number of inter-related
thematically-grouped sets of work packages.
In this group of work packages, DART is tackling the issues
surrounding high-rate and large-volume data streams, particularly those
generated by instruments and sensors. There are a number of
requirements that are unique to the challenges inherent in dealing with
digital objects generated by and derived from instruments and sensors:
The DART project has chosen to base this group of work
packages on the Common Instrument Middleware Architecture (CIMA) [HREF14].
This
architecture emerged from work supported by the National Middleware
Initiative specifically for connecting instruments and sensors to the
Internet. The CIMA architecture allows the connection of instruments
and sensors to the internet, and makes them discoverable and their
results publishable using web services or the open grid services
architecture. In other words, this architecture allows DART to leverage
the work being done by the National Middleware Initiative in the US and
the Open Middleware Institute Initiative in the UK in using their core
middleware for security, access, file transport, etc. Because the CIMA
is based on international standards, middleware produced using this
architecture will be re-usable in other projects.
This group of work packages relates to the need to work with
documents, datasets, simulations, software and dynamic knowledge
representations in a secure way with controlled access. This includes
collection from a range of devices, secure transfer across networks,
storage on high-capacity devices, management and preservation in
repositories, and maintaining the integrity of the datasets.
The digital objects that DART stores need to be managed, preserved,
persistently identified, aggregated and disseminated in flexible ways
in order to deliver the improvements outlined in Figure 1. Of the
available pieces of widely used repository software (Open Society
Institute, 2004), Fedora has been found by a range of projects to be
the best match for these requirements. Fedora (Lagoze, et. al. 2005) is
“an open source, digital object repository system using
public APIs exposed as web services.” (Staples, Wayland and
Payette 2003). Its architecture is very flexible, and provides
significant advantages as a platform on which to build other
applications. In particular, in a DART context it provides the ability
to store and manage complex objects and the relationships within and
between complex objects. The ARROW team at Monash University have been
using Fedora for over 18 months now and are one of a small number of
projects which are collaborating as part of the Fedora Developer
Consortium. DART staff members at the University of Queensland have
also been collaborating with the Fedora developers at Cornell (Doerr,
et. al. 2003). DART will facilitate distributed data management using
Fedora (SI1).
The pre-eminent technology for working with large datasets is the Storage Resource Broker (SRB) [HREF15], developed at the San Diego Supercomputing Center (SDSC) (Moore 2004a, 2004b). SRB can be described as “client-server middleware that provides a uniform interface for connecting to heterogeneous data resources over a network and accessing replicated data sets. SRB is widely deployed across the world in demanding, large-capacity and high-bit rate environments. In conjunction with the Metadata Catalog (MCAT), it provides a way to access data sets and resources based on their attributes and/or logical names rather than their names or physical locations. The SDSC SRB system is a comprehensive distributed data management solution, with features to support the management, collaborative (and controlled) sharing, publication, and preservation of distributed data collections. The SRB also serves as middleware via a rich set of APIs available to higher-level applications and by providing a management layer on top of a wide variety of storage systems. In order to extend Fedora to work with large datasets, the DART project will need to integrate SRB with Fedora, both as a replacement storage layer for Fedora itself (SI1), and as a location for content outside a Fedora repository but managed by it (SI2). In order to build more advanced knowledge mining services in the future, the MCAT also needs to be semantically augmented using the Resource Description Framework. This will support richer descriptive and preservation metadata for dataset objects to enable more effective discovery (SI3) (SI3).
In addition to the work with Fedora and SRB, DART will work to:
An enormous amount of research data is currently stored within
personal or private archives, either on researcher desktops or
departmental/institutional servers. In these locations it is largely
inaccessible to, and undiscoverable by, other researchers or the
public. This group of work packages will investigate methods,
incentives and technologies to motivate researchers to submit their
research data and results into institutional repositories. This will
include the development of:
Note that assessment technologies (see section 3.5) that
support qualitative and quantitative assessment of research deposited
within institutional repositories will also provide additional
incentives for researchers to deposit their results.
This group of work packages relates to tools and services that enable peers to attach reviews, opinions, comments or assessments to research data, reports, publications etc. These annotation and assessment services can serve either as an alternative, or addition, to existing peer-review mechanisms. This can be seen as a completely new certification function made possible through this new distributed networked environment. Two annotation approaches will be trialled. One builds on annotation research originally carried out at DSTC, looking at annotations that are managed and stored external to the digital objects. The other approach builds on work underway at JCU to create collaborative documents including annotations.
The first work package (AA1) will concentrate on extending and refining existing annotation tools to enable annotation of digital objects held within the Fedora and SRB research repositories such as SRB, DSpace or Fedora. The second work package (AA2) will concentrate on tools to support collaborative annotations, thus enabling research communities to document shared practices and assessments. This will involve the refinement and deployment of the Vannotea software developed within DSTC’s FilmEd project. Vannotea is designed to enable real-time annotation of complex digital objects (images, video, 3D objects) by geographically distributed groups within a videoconferencing environment (Schroeter et. al., 2003). A third work package (AA3) will focus on the development of secure authenticated access to annotation servers through the development of a Shibboleth-based interface to the W3C’s open source Annotea server (Barstow et. al. 2001). This will allow different groups who might want to annotate resources for different purposes (such as referees, grant committees, researchers) different levels of access. A fourth work package will involve piloting the use of hosted wikis (see below) linked to research data repositories to facilitate interaction between researchers and research groups.
A wiki is a relatively new web collaboration tool that can be a genuinely innovative and useful tool for research collaboration and information management. A key concept of the wiki is that it makes it very easy for any user to easily add content, files and other digital objects to a wiki environment without any need to understand web authoring technologies. Essentially wikis start as empty repositories, and are built, dynamically by the users of the community. Because rights management (groups etc.) can be easily added to the wiki, and they are web based, they also form the basis of an effective information dissemination system. A case study may be instructive. The predictive mineral discovery Cooperative Research Centre (pmd*CRC) [HREF18] is one of the most geographically distributed and diverse CRCs within the program. It also has a very direct industry/outcome focus. For the last 18 months the pmd*CRC has very effectively augmented its conventional face-to-face (and more recently AccessGrid facilitated) meetings with what is now a very sophisticated wiki framework. Users from any part of the CRC can read about the work/plans/outputs of any part of their program within the CRC. However, they can also look at raw data / images / calculations / analysis that have been conducted and uploaded into the format. Any user (with appropriate access permission) can add content, comment or provide feedback and generally interact and be informed of the progress within the entire CRC. JCU has a nascent program to model a series of discipline specific wikis as group/department level tools. The key feature is that a wiki that supports a single research group can – with almost no effort – be cloned and support an entire school. Ultimately this process can continue without limit so that almost any type of scale of collaboration can be supported. DART will build on this and integrate wiki technology with the Storage Resource Broker (AA4). This combination will allow a wiki to act as an interface to, and commentary on, distributed file systems, data grid management systems, digital libraries and semantic webs.
At present, the final publication is seen as the only research
record worthy of capture and curation. Both the annotation and wiki
technologies described above will allow for the capture of a record of
some of the collaborative activity around the datasets and other
research outputs.
This group of work packages relates to tools and services that enable researchers and readers to search, browse and discover resources within the repository and access them, either under controlled conditions or in an unrestricted way. It will involve the development of portals that provide seamless search interfaces across distributed archives implemented in SRB and Fedora (DA2). Ontologies and the semantically-augmented MCAT RDF data store will be developed to provide semantic interoperability across heterogeneous metadata schemas. Shibboleth and PKI will provide the authentication and access control to improve repository deposit rates, sharing and reuse by allowing end-user control over who can access what (DA1)
In addition, one work package will develop and provide access to a centralized repository/registry of metadata schemas and ontologies (DA3). Metadata schema registries enable the publication, navigation and sharing of information about metadata. This registry will act as the primary source for authoritative information about recommended metadata schemas. It will enable the sharing and re-use of existing metadata schemas and application profiles – thus enhancing interoperability and reducing costs and effort. This work package will build on the open source software tools being developed within the JISC IE Metadata Schema Registry Project (IEMSR) [HREF19] by UKOLN and ILRT. DART will also work with other related projects towards ensuring that metadata in other repositories is managed and exposed in standards-compliant ways. This will enable later federation through work outside the scope of this bid.
The first body of research arises out of nearly 15 years of examination of the current system of scholarly communication. Much of this activity has been focussed on the scholarly article as the main form of output. A recent article by Herbert van de Sompel from Los Alamos and others (Van de Sompel, 2004) argues that much current experimentation within the publishing world is both limited and backwards-looking. They argue for two radical changes.
The first is to deliberately engineer a new scholarly communication system that is intertwined with the process of generating new knowledge. Roosendaal and Geurts (1997) have distinguished the following basic functions required from any system of scholarly communication:
The researchers in Van de Sompel (2004) argue that we should decompose the current system of scholarly communication into a network of scholarly value chains. A repository (serving the registration and possibly certification functions) would be one hub on such a chain. Similar chains are also developing for datasets in the Grid domains, through network-based services for data sharing and information storage.
The second change is to redefine what constitutes a unit of communication. Instead of just focussing on journal publications, they suggest that:
They also identify the need to develop “information models, process models, and related protocols to enable interoperability among existing repositories, information stores and services”.
The second body of research draws on systems theory and sees scholarly communication as an ecology (Kaufer and Carley, 1995). In this ecology, the communicative transaction is a cyclic process of interaction, communication and adaptation between actors and entities. Like any ecology, each member affects the others, and complex behaviours emerge in unpredictable ways. Kaufer and Carley applied this model to the published research output. What happens if one considers the ecology around the entire process of research? Instead of a fairly linear model leading from idea -> research -> publication output -> reader, one ends up with an ecology consisting of Actors (researchers and readers, who may well be the same person) and Entities (Ideas/Problems, Experiments/Research Activities, Results, Outputs). These actors and entities are all co-evolving, co-adapting and influencing one another. As in real ecologies, these influences occur in a very non-linear way, and changes emerge in ways that are difficult to predict in advance.
The third body of work has been developed by a group of researchers in the School of Information Management and Systems at Monash University. They have developed the notion of an information continuum, based on a multiple-axis analysis of the various characteristics of information in organisations (Schauder, et. al. 2004). This research team is building on the information continuum work as part of a project called “Memories, Communities and Technologies”. This project is already looking at how knowledge is created and communicated in research communities. The memories are what the research community knows, encoded in tangible publications and intangible practices. The technologies themselves support the storage and transmission of these memories, as well as the creation, use and re-use of knowledge. As the communities develop and transform themselves along the memory axis of the information continuum model, technology can be applied to enable this transformation.

Figure 1: Inter-related bodies of theory
The relationship between these three bodies of theory (Re-engineering scholarly communication, scholarship as ecology, and the information continuum) is shown in Figure 1. The DART project is integrating and drawing on this rich theoretical context.
The project formally received its funding and commenced
operations in early December. Technically, it is a condition of the SII
funding that MERRI projects complete by the end of 2006. While it is
possible that DEST will permit the rollover of unspent funds into 2007,
the project has decided to work towards meeting the original deadline.
As of the time of writing, DART is fully staffed and consists of a
full-time Project Director (Dr Jeff McDonell), a part-time Project
Architect (Dr Andrew Treloar), 40 project staff (some part-time) and 7
Chief Investigators:
DART has decided to model its project governance on the very successful
ARROW [HREF20]
project. In addition to the main DART office (based at Monash
University), the project has two committees.
The Board of Management is chaired by Professor Ah Chung Tsoi, the
inaugural Director of the e-Research Centre and Monash University and
contains two senior representatives from each of the three DART
consortium members. The Board of Management meets bi-monthly to monitor
project progress and consider longer-term strategic issues.
The Technical Committee is chaired by the Project Architect. It meets physically bi-monthly (to coincide with the Board of Management schedule) and virtually by tele-conference every other month. As well, there is a weekly tele-conference using Skype every Friday. Because of the large number of work packages (27) it is critical to ensure effective co-ordination.
One of the deficiencies of the DART bid as submitted was that it did not discuss exactly how the end-to-end benefits of the DART approach to e-research infrastructure would be tested and demonstrated. Accordingly, late last year the Board of Management requested the Technical Committee to create a number of demonstrator applications. These demonstrators will serve as testbeds for DART technologies and will also serve to raise some of the system integration and security challenges early in the project. The model being used is to create sample end-to-end sequences of activities in a number of different disciplines and gradually fill in the steps in these sequences. Early in 2006 many of the steps will be statements of intent rather than actual code. Over the course of the project, these steps will be progressively elaborated upon and integrated together.
Initially, two demonstrators have been chosen: climate research and X-ray crystallography. These both have researchers at each of the three DART consortium partners, who in many cases are already collaborating. They can also apply DART technologies all the way from collection of data from sensors/instruments all the way to annotation of resulting publications. The third demonstrator, to be developed later, will be in the area of the social sciences and humanities.
The DART project is a very ambitious, time-critical and high-risk project, working with leading-edge technologies in the rapidly changing domain of e-Research. It is unlikely to produce production-grade services by the end of 2006. It should, however, be able to provide proofs of concept for a number of innovative services, contribute code back to existing open-source infrastructure technologies (Fedora, SRB, CIMA) and demonstrate the benefits of the integrated DART approach to e-Research infrastructure. It is to be hoped that the National Collaborative Research Infrastructure Strategy (NCRIS) [HREF21] will be able to build on top of DART’s pioneering efforts.
The DART project has been funded by the Australian Commonwealth Department of Education, Science and Training through to the end of 2006. The funding has been provided through the Systemic Infrastructure Initiative as part of the Commonwealth Government's Backing Australia's Ability - An Innovation Action Plan for the Future.
The author also wishes to thank the two anonymous reviewers
for their comments, which have improved the final paper.
Abramson, D. 2005. “Software Development for the Computational Grid”, Remote Access and Automation Workshop, Sydney, 2005. Abstract online at [HREF22].
Atkins, D. et al. 2003. National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure, Revolutionizing Science and Engineering through Cyber-infrastructure. Available online at [HREF23].
Barstow, A, Kahan, José, Koivunen, M-R, Swick, R. “Annotea: A Generic Annotation Environment using RDF/XML”, WWW10 Developers Day, Hong Kong, May 2001
Clapham, N. T., Green, D. G. and Kirley, M. (2001). Emergent information systems - the role of adaptive agents. Australian Journal of Intelligent Information Processing Systems 7 (3/4), 96-101.
The Joint Information Systems Committee, 2004. The Data Deluge: Preparing for the explosion in data, Available online at [HREF24].
Kaufer, D. S. and Carley, K. M. 1993. Communication at a Distance: the influence of print on sociocultural organization and change. Lawrence Erlbaum Associates, Hillsdale, New Jersey.
Lagoze, C., Payette, S., Shin, E. and Wilper, C. 2005. “Fedora: An Architecture for Complex Objects and their Relationships”. International Journal of Digital Libraries: Special Issue on Complex Objects, Springer 2005. Available online at [HREF25].
Lyon, L. “eBank UK: Building the links between research data, scholarly communication and learning”, Ariadne, Issue 36. Available online at [HREF26].
Moore, R. 2004a “Integrating Data and Information Management”, International Supercomputer Conference, June. Available online at [HREF27].
Moore, R. 2004b. “Evolution of Data Grid Concepts”, Global Grid Forum Data Area Workshop, January. Available online at [HREF28].
Open Society Institute, 2004. A Guide to Institutional Repository Software v 3.0. Available online at [HREF29].
Roosendaal, H., and Geurts, P. 1997. Forces and functions in scientific communication: an analysis of their interplay. Cooperative Research Information Systems in Physics, August 31—September 4 1997, Oldenburg, Germany. Available online at [HREF30].
Schauder, D., Stillman, L., & Johanson, G. 2004. Sustaining and transforming a community network. The Information Continuum Model and the Case of VICNET. Paper presented at CIRN 2004: Sustainability and Community Technology, Monash University, Prato, Tuscany, Italy. Available online at [HREF31].
Schroeter, R., Hunter, J., Kosovic, D. "Vannotea - A Collaborative Video Indexing, Annotation and Discussion System for Broadband Networks", Knowledge Markup and Semantic Annotation Workshop, K-CAP 2003, Sanibel, Florida Oct 2003. Available online at [HREF32].
Staples, Thornton, Wayland, Ross & Payette, Sandra 2003. “The Fedora Project: an open-source digital object repository management system”, in D-lib Magazine, April. Available online at [HREF33].
Treloar, A, 1999. Hypermedia Scholarly Publishing: the Transformation of the Scholarly Journal. PhD Thesis, Monash University 1999. Available online at [HREF34].
Treloar, A. (2005). "ARROW Targets: Institutional Repositories, Open-Source, and Web Services". In Proceedings of AusWeb05, the Eleventh Australian World Wide Web Conference, Southern Cross University Press, Southern Cross University, July. Available online at [HREF35].
Treloar, A. (2004). "Building an Institutional Research Repository from the Ground Up: The ARROW Experience". In Proceedings of AusWeb04, the Tenth Australian World Wide Web Conference, Southern Cross University Press, Southern Cross University, July. Available online at [HREF36].
Van de Sompel, H, et. al. 2004. Rethinking Scholarly Communication: Building the System that Scholars Deserve. Dlib Magazine, September. doi:10.1045/september2004-vandesompel. Available online at [HREF37].
Waters, D. 2003. Cyberinfrastructure and the Humanities. Fall Task Force Meeting of the Coalition for Networked Information. Available online at [HREF38].