The Australian Plant Pest Database: A national resource for an expert community

Ian Naumann, Emma Lumb. Australian Government Department of Agriculture, Fisheries and Forestry, Canberra

Kerry Taylor, Robert Power, David Ratcliffe, Michael Kearney. CSIRO ICT Centre, Canberra.
Email: Kerry.Taylor@csiro.au

Abstract

The Australian Plant Pest Database (APPD) is a Web-based virtual database of vital importance to the Australian agricultural industry. The APPD contains specimen based information on over a million plant pests and diseases that occur in Australia. The existence of the APPD is owed to a major collaboration between Industry, State and Federal Governments, research organisations and universities.  The APPD program is supported and managed by Plant Health Australia and the Australian Government Department of Agriculture, Fisheries and Forestry (DAFF).

This paper reports on the approach to the successful development of the APPD for the period prior to its launch in April 2002 and up to the present day.  The development of the collaboration was reflected in the development of the software design, with an emphasis on querying over web-enabled original source databases: the data remains under the local management of the respective custodian.

Introduction

Under the "SPS Agreement", an international trading rule of the World Trade Organisation (WTO), it is recognized that member countries are able to provide the level of health protection that they deem appropriate to reduce unacceptable quarantine or biosecurity risks to their country from imported plant and animal based products [HREF1]. However, the measures they apply to protect human, animal and plant life, must be based on scientifically justifiable decisions. Access of Australian agricultural produce to international markets and corresponding bids from other countries to new Australian markets depends critically on this scientific proof.

The process used to assess the environmental, social and economic risk of the introduction of new pests and diseases through new imports, is known as Import Risk Analysis, and forms a fundamental component of any new trade negotiations between two WTO member countries.  The analysis is based on scientific evidence that demonstrates freedom from certain pests and disease, contributing to a regional and national view of Australia’s plant and animal health status. Each year, Biosecurity Australia, within the Australian Government Department of Agriculture, Fisheries and Forestry, receives more than a dozen requests for access to Australia’s agricultural and livestock markets, and in each case, an Import Risk Analysis (IRA) must be performed in conjunction with the counterpart countries’ list of pests known to occur on the given commodity.

An independent review of Australian quarantine in 1996 recommended that databases and other information systems containing data on native and/or exotic pests and diseases be maintained, and that this information system present a “whole of Australia” view [1].  The Review recognized that an up-to-date, national view of Australia’s plant, animal and human health status is essential to make informed quarantine, human, animal and plant health decisions, risk analysis and policy development consistent with SPS Agreement obligations. Consultant input into the review advised that "currently there is no national coordination of information generated by the various States on…plant pests and diseases", and the review went on to recommend that “the decisions of Australia’s plant industries and plant health and quarantine staff would be easier and more targeted, if, for example, a national database of all major pests and diseases present on each crop in Australia were available electronically".

For those Import Risk Analyses based on plant or plant derived products, pest and disease reference collections are the only sources of verifiable data on plant pest and disease distribution in a given region or country. Only permanently preserved voucher specimens can be relied upon in published lists of disease or pests, because only these specimens can be re-examined to prove their veracity. Published reports which are not supported by voucher specimens in curated collections are misleading in that the quality of their identification is not verifiable and furthermore, once published cannot easily be disproved.  Incorrect reports of pest status in a country can be damaging to a country’s ability to participate in international trade.

The Australian Government’s response noted that “…risk analysis is the foundation stone on which all quarantine policy and action must be built”, and in 2001, a 3 year Commonwealth Grant was given towards the development of the “Australian Plant Pest Database” (APPD), a national, integrated system that would allow an Australia-wide view of plant health status. The APPD was recognized as being able to assist Australia in the development of efficient, credible State and national quarantine systems, assist in the protection of its industries and the environment from the threats posed by exotic pests and diseases, facilitate industry to gain and retain access to interstate and overseas markets, provide improved diagnostic, advisory and research capabilities Australia-wide, and enable the linking of detailed information on the occurrence of diseases and pests to other plant health databases maintained by government agencies and industry stakeholders.

The APPD makes use of existing State, university and industry owned collections. Many of these collections have been developed to support to State level responsibilities to domestic agricultural protection and trade.  The Commonwealth’s need for this information is derived from its responsibilities for quarantine and international trade [2].

An early decision was taken to develop the APPD as a federated database for the following reasons:

A federated database architecture has the most flexibility for satisfying the heterogeneous needs of the participants, offering a great many configuration alternatives for each participant, according to their capacity and willingness to contribute as well as their agency policies and technical environment. The development of the APPD occured concurrently with the early stages of the development of IT outsourcing plans in many government agencies, and with agency organisational restructures. While participants were able to commit to the principles of data sharing through the APPD, very few were able to commit to any particular local technical infrastructure, so a design with maximal support for a variety of configurations was essential.

The Software Architecture

Overview

The APPD system architecture allows the linkage of heterogeneous data sources with minimum impact on existing systems by using an integration framework developed by CSIRO Mathematical and Information Sciences [3,4]. This framework provides a common view of the participating systems allowing a targeted application to be developed, the information broker. The broker provides an aggregate interface to the collection of data sources as a single logical or virtual view. To the user communicating with the broker, the information is presented as a centralized data source (although, according to the requirements of the application, information may be associated with the supplying source).

AppdArchitecture1 (40K)

Figure 1: APPD Systems Architecture.

The APPD system architecture is depicted in Figure 1. The client interface is a web browser displaying an HTML page allowing a user to define a query, submit it and display the results. The middle tier broker offers a user interface for Web clients, and distributes queries to the gateways. The broker receives a request from the client; implements security mechanisms for access control; queries the gateways; receives results; integrates them; and sends a response to the client. In the APPD, the data sources are accessed only by the broker via the gateway.

A gateway translates incoming requests into a format specific to the local database system; performs the native query; translates the results into XML and transfers these back to the broker. Gateways may implement local security mechanisms such as filtering out records that have not been formally verified.

The arrows in Figure 1 show an example of the flow of communication between the various components. Note that not all data sources need to be involved when satisfying a client request for information.

Gateways present the broker with a uniform interface to the different data sources. Gateways must:

These restrictions allow gateways to be added or removed from the system easily, simply by changing network information in the broker’s configuration.

In providing a uniform interface, the gateways must hide differences in the data sources: different operating systems and hardware platforms, different data repositories (typically a database management system), which are accessed using different query languages.

The system is query only. There is no facility for updating data source information through the gateway. Normal data source update mechanisms can be used independently of the APPD, so long as the data source allows queries to be performed while updates are in progress. Any change to the structure of data accessed by the gateway requires corresponding changes to the gateway. The modular gateway design allows this to be easily achieved.

Functional distribution of integration tasks

In the taxonomy of Sheth and Larson [5], APPD is a loosely coupled heterogeneous federated database. This means that component database managers may control the access to local schemas offered to the federation through the gateway installed at the component database site. Gateways are customised to a particular database management system (DBMS), taking account of its query language, and exporting a common data model.  Mappings between the local and export schemas are implemented within the gateway through configuration information, not embedded in the gateway code.

In practice, the broker component only supports the use of a fixed schema in the federation, so the architecture could be described as tightly coupled. Component database managers could vary the schemas offered to the federation. However, re-programming of the broker would be required to make use of the varied schemas.

In the component model of [5], command and data transforming processors in each DBMS-specific gateway offer data model transparency. Filtering processors in each gateway include a syntactic constraint checker and an access controller. The role of accessing processor is delegated to the native DBMS.  A constructing processor embedded in the broker supports location and distribution transparency by performing schema integration and query decomposition, although these are simple tasks in this environment as the component export schemas are assumed to be uniform and the mapping from a query to a location relies on a straightforward relationship between biological taxa and component database coverage.

Autonomy was considered essential to the feasibility and long term viability of the APPD, and this system permits a great deal for the component databases: full design autonomy, execution autonomy, communication autonomy and association autonomy.

Pre-existing component databases comprise various DBMSs (MS Acess, MS Excel, Oracle, Texpress and specialist biological database systems such as BioLink and KE Emu) and it was unfeasible to change these to a standardised model in all cases. In some cases, responsibility for selection and acquisition of database software was outside the control of participants. In some cases, the cost of change was considered too high.  Although purchase of a specialized host DBMS to support the local contribution to the APPD - and then specialist transform and load procedures to keep it up to date - remains an option, it is not viable in many cases usually due to the cost and lack of IT skills to support the dedicated environment.

In some cases the working, transactional database is used directly - this minimizes maintenance tasks for database custodians and ensures their contribution to the APPD is always current. In other cases a mirror of the working database is managed by the custodian on the custodian’s site: this is a suitable approach for heavy-workload databases or to meet some agency’s security policies. In a few cases the custodians’ access to networking facilities was so constrained or plans for long-term data management at their own agency was so fluid, that the approach taken was to mirror the database at an alternative host site, with an ad-hoc update mechanism.

Nevertheless, by distributing the responsibility for dynamically serving APPD requests to the custodians, the responsibility for data availability and quality is also appropriately distributed, carrying benefits for the system as a whole (improved quality) and for the participants (retaining the role of expert authority on the data; justifying the scientific work of the collection; autonomy to deny access to any party at any time).

The data integration is facilitated by the level of uniformity of data modelling already existing amongst collection databases. This is because of the long established and relatively well defined science of taxonomy (nomenclature) and systematics (associated relationships) which underpin the data. Some data heterogeneity exists between individual databases (field definitions), pest organism classification (e.g. insects versus microfungi), and host plant/pest relationships.  Differences were addressed by asking data custodians to agree on data standards and by providing flexibility in the search capacity to account for various pest organism types (viruses, viroids, invertebrates, pathogenic fungi and bacteria).

Implementation technology

The client interface is shown in Figure 2. The interface is HTML with embedded JavaScript providing validation of user input.

UserInterface1 (20K)

Figure 2: APPD main query page, showing collection selection, taxa- and location-based query parameters and nomenclature links to the Australian Faunal Directory.

The broker and gateway software is implemented in Java and deployed as servlets in Apache and IIS web servers. The broker may be configured to list and describe the gateways available. More significantly, the gateways themselves are configurable software modules using configuration files and deployment descriptors. These files encode the logical mapping from the broker’s view of the data source to the actual data structures. In some cases, this mapping is not entirely possible through configuration files alone, in which case the specific gateway’s Java code is extensible providing the necessary data mappings. A depiction of these components is shown in Figure 3.

AppdArchitecture2 (17K)

Figure 3: APPD Implementation.

The client query is sent as a CGI request and can include query constraints based on a combination of one or more of: pest name; date and location of specimen collection; and the host details (on which the specimen was collected). The broker examines the request and uses data source metadata to determine which data sources need to be queried. The query is then forwarded to these data sources in parallel. The results returned are collated together and returned to the client as an HTML web page.

Most data sources can be accessed from the gateway software using JDBC. Currently this is used to connect to Oracle databases, MS SQL server, the specialist biological database tool BioLink (an MS SQL server application), MS Access and Excel. The Texpress system [HREF2] has a C API and the Java Native Interface (JNI) was used to connect Texpress from Java.

The APPD uses HTTP Basic Authentication to validate users. Communication from the broker to data source gateways may be configured to use the HTTPS protocol, adding a level of security protection to the local systems with no inconvenience to APPD users. The broker has been issued with a certificate signed by an APPD self signed root certificate. The gateways may optionally be issued with a certificate, also signed by the same APPD root certificate. By installing a certificate containing the APPD root certificate's public key, a broker and gateway may exchange certificates to mutually authenticate each other. Certificates are managed through a combination of the Java keytool and openssl tools.

Mapping functions

The federated data schema includes latitude and longitude information for each record specifying the location of where a particular plant pest was discovered. Using this information where available, the broker offers users an interactive map with a plot of the locations of pests for each of the records retrieved in their query. This feature is implemented using Mapserver [HREF3], an open source web based browser for GIS data developed at the University of Minnesota. Mapserver is implemented as a CGI application which reads in GIS layer data generated by the broker in response to a user query. Multiple GIS data layers may be generated for each individual query where each layer correlates to records from a particular source database to be rendered in a distinctive manner. Mapserver renders all GIS layer data as an image and serves it to the user in an interactive HTML page allowing the user to issue queries back to the application such as zoom, pan and queries to interrogate details of plant pests at a particular geographic location, as illustrated in figure 4.

UserInterface2 (25K)

Figure 4: Species distribution map.

Deployment

Gateways and their databases are located all over Australia including Perth WA, Knoxfield Vic, Orange NSW, Indooripilly Qld and Hobart Tas. In some cases the gateway is hosted outside the custodian’s firewall, in other cases specific router configurations allow access to the working database. Three small databases are currently hosted on behalf of their custodians at CSIRO in Canberra because the technical infrastructure of the custodians is not able to permit local access. An occasional email exchange mechanism is used to keep them up to date.  It is a testament to the flexibility of the APPD system that it permits and manages such a range of configuration options to match the widely differing requirements and capabilities of participants.

The ease with which a gateway is deployed and/or maintained depends on the availability of, and access to, IT personnel in the host organisation. The custodians of the data typically do not have well developed relationships with the IT support function. It takes time to develop sufficient understanding and knowledge of an APPD gateways technical needs: Java servlet, Tomcat webserver, Java runtime environment and external access to to the host's network. Authorisation for remote network access to the gateway host allows CSIRO staff to carry out most of the required functions without additional support from the local IT staff. This has become our preferred approach.

Since 2004, the broker has been deployed at a commercial Internet Service Provider in Canberra. Remote access is used to maintain this system.

The Development experience

One interesting aspect of the APPD is its achievement as a collaborative development of national and local significance between 15 university groups, State government agencies, Commonwealth agencies and an industry body, based on exchange of sensitive information. Factors that were, in hindsight, critical to the success of the APPD fall into three classes:  funding; software architecture; and governance structure and process.

We discuss these briefly below as we hope that they may provide a useful starting point for similar collaborative projects. Figure 5 summarises these key factors together with some additional issues that arise due to the nature of the biological information content.

Funding
The APPD’s cash funding was provided by the Commonwealth through Plant Health Australia, a not-for-profit agricultural industry body. This funding encompassed establishment of a working group and governance structure, software design and development, and data-capture funding for agencies. The data-capture funding was matched by in-kind agency contributions, and was directed simultaneously towards benefiting the agencies’ own data management goals.
Software
The software architecture was designed to reflect the organisational structures of the participants with maximum flexibility of configuration; this has been discussed in depth earlier in the paper. While data standards were developed and adopted at the common interface level, promoting reliable interpretation, these standards were not imposed on data-providing participants but provided through software translation as far as possible. This enabled a much more rapid development than is possible otherwise; and created the information, experience and environment for a movement towards appropriate data collection standards.
Governance
As a first step in the development, a Steering Committee was established with representation including funding bodies, users and data providers; and chaired by a representative of the Office of the Chief Plant Protection Officer of DAFF, with a mandate to implement the recommendation of the Nairn review. Rules of operation, including access and usage were established by the Steering Committee in parallel with the design and development of the system. The Steering Committee jointly set priorities for data capture activities for the participants, and still meets regularly to deal with requests for access or occasional data integrity problems.  The Steering Committee is served by a secretariat established at the Office of the Chief Plant Protection Officer. All formal contracts are established with Plant Health Australia as the funding body and having the ultimate responsibility for the system.
Challenge APPD Solution
Heterogeneity: databases and taxa type/ different fields Wrapping software designed, well defined data standards set
Synonyms Established data standards, reference to master name lists
Data quality Ongoing data validation
Life stages / morphospecies and nomenclature Consistent approach
Data security / Intellectual property Restricted access, Memorandum of Understanding, Disclaimer
Resources Federal funding, collections provide via "in-kind", APPD viewed as mutual benefit
Management Steering committee of curators, plant health scientists & industry; explicit Rules of Operation
Where to start ? Well known and important pest groups, funding for master names

Figure 5: Challenges in the development of APPD

The APPD today

The APPD has been in operational use since April 2002.  In December 2005 there were 207 registered users. In the previous 3 months there were an average of 1800 queries per month, and an average of 88000 total records returned. Approximately 50% of queries were made by Biosecurity Australia for its preparation of IRAs. It is used to support existing market access arrangements with those countries Australia exports to and as a reference base for front line quarantine officials in the process of identifying possible exotic pests at point of interception. It is also used for State and the National Plant Protection Organisation (Office of the Chief Plant Protection Officer) in the initial stages of suspected, exotic plant pest incursions, and to fill in gaps in the scientific knowledge base on given pests. National data allows better predictions on host and geographic range and assists with other important pest biology research conducted within the participant agencies.

Recent enhancements, building on the general features described earlier, provide the ability to search for alternate names of pests and their hosts through the Australian Faunal Directory [HREF4], and a batch processing mode specifically to support the efficient preparation of IRAs.

In April 2006, an external review of the APPD was carried out. The review sought stakeholders views on both scientific issues (data quality, data coverage) and well as technical issues (system availability, performance etc). The review report is not yet available.

In future years, the APPD may be linked into the Australian Biodiversity Information Facility [HREF5], a data portal that will provide access to checklists of species names and allow for searching of those specimens and observations contained in the biological collections of ABIF participants.  If the APPD is linked to the ABIF, it will be possible to query APPD data along with that from other data sharers and use various biological tools available through the site to conduct preliminary analysis.

Similar systems

Other communities of interest both in Australia and internationally have developed web based system to provide access to distributed collections. Some examples from Australia and overseas are:

The Australian Virtual Herbarium (AVH)
AVH [HREF6] is an on-line botanical information resource accessible via the web. It provides access to the data associated with scientific plant specimens in each Australian herbarium. The approach and design of AVH is similar to the APPD: databases remain with the custodians, with a gateway providing network access to the data.
Online Zoological Collections of Australian Museums (Ozcam)
Ozcam [HREF7] is an online distributed network of databases that contain information about the faunal (animal) collections held in Australian museums and other institutions, such as CSIRO. It provides two text based query interfaces, and can be also be accessed using the DiGIR protocol [HREF8]

Internationally the Global Biodiversity Information System (GBIF) [HREF9] seeks to digitise and disseminate primary biodiversity data on a global scale. Australia contributes a node to this network [HREF3].

A system developed in the US with similar objectives to the APPD is SALVIAS [HREF10]. It provides single-source searching of multiple databases through a mixture of web-based distributed queries, direct server links, local caching, and standardization of the information returned. SALVIAS also provides tools for correction and standardization of spelling of plant species names (TaxonScrubber), and tools for correction, standardization, and estimation of geo-coordinates from herbarium specimen records (GeoScrubber).

Conclusion

The Australian Plant Pest Database is a case study in the development of a valuable national information resource based on Web technology. Its design marries naturally with the design of the community it serves; it may not have been possible to achieve it at all before Web and internet technology became commonplace. Today, it is a vibrant and well-used information resource supporting Australian international trade.

A second allocation of Commonwealth funding has been used to enhance and review the system in 2005/06, in particular with a view to making some data publicly available as trade sensitivity has prevented it thus far. It may be made available through the Australian node of the Global Biodiversity Information Facility [8]. The potential structure of this Australian node will allow primary databases of biological information from all GBIF databases to be collectively searched, the results saved on the Australian node and then analysed using other tools available through that portal.

Currently, the CSIRO team that developed the APPD software is developing distributed systems software that supports much more flexibility in the application to which it is applied. The goal of this work is to enable domain experts or end-users to assemble their own distributed applications out of databases and software components that are made available as WSDL and SOAP-compliant Web Services. While a distributed querying and coordination approach is now well developed [6], the key research emphasis at present is on the development and use of rich service descriptions based on the Web Ontology Language, OWL. By leveraging both the human interpretation of these descriptions as well as the machine-processable semantics, it is hoped that a domain expert can locate and compose the services they need into larger domain-tailored applications. These concepts are being tested in problems of population health research, coalition information sharing and water resource management [7].

Acknowledgements

The successful development and deployment of the APPD has involved many people and organisations that have made major contributions to its development. We thank the representatives of the APPD participant organisations: Adelaide University/ South Australian Research and Development Institute, BSES Ltd, Commonwealth Dept. Agriculture, Fisheries and Forestry, CSIRO Entomology, Dept. Conservation and Land Management (WA), Dept. Primary Industries (NSW), Dept. Primary Industry, Fisheries and Mines (NT), Dept. Primary Industries and Fisheries (Qld), Dept. Primary Industries, Water and Environment (Tas), Dept. Primary Industries (Victoria), Dept. Agriculture (WA), Forestry Tasmania, University of Queensland; the members of the steering committee; and others in CSIRO (Leslie Zhang and James Murty) and Office of the Chief Plant Protection Officer, DAFF (Paul Pheloung).

References

  1. Nairn, M.E., Allen, P.G., Inglis, A.R. and Tanner, C. (1996). "Australian Quarantine: A shared responsibility". Department of Primary Industries and Energy, Canberra.
  2. Lumb, E., Naumann, I. D., Pheloung P. "The Australian Plant Pest Database". In Global Taxonomy Initiative in Asia. Research Report from the National Institute for Environmental Studies, Japan. No. 175:273-280.
  3. Power, R.(2002). "Australian Plant Pest Database : Gateway Implementation". CSIRO Mathematical and Information Sciences Technical Report 02/62.
  4. Power, R. (2002). "Australian Plant Pest Database : System Documentation". CSIRO Mathematical and Information Sciences Technical Report 02/61.
  5. Sheth and J. Larson (1990). "Federated database systems for managing distributed, heterogeneous, and autonomous databases" in ACM Computing Surveys, 22 (3) pp. 183-236.
  6. Cameron, M. and Taylor, K. (2005). "First order patterns for information integration". In Proceedings International Conference on Web Engineering ICWE2005, Sydney, Australia. Springer LNCS 3579.
  7. Ackland, R., Taylor, K., Lefort, L., Cameron, M., Rahman, J.(2005). "Semantic Service Integration for Water Resource Management". In Proceedings of International Conference on the Semantic Web ICSW05, Springer LNCS, 3729, Pages 816 - 828

Hypertext References

HREF1
https://www.ippc.int/servlet/CDSServlet?status=ND0xMzQyMiY2PWVuJjMzPSomMzc9a29z
HREF2
http://www.mel.kesoftware.com/texpress/index.html
HREF3
http://mapserver.gis.umn.edu
HREF4
http://www.deh.gov.au/biodiversity/abrs/online-resources/fauna/afd/index.html
HREF5
http://www.abif.org/index.htm
HREF6
http://www.anbg.gov.au/avh/
HREF7
http://www.ozcam.gov.au/
HREF8
http://digir.sourceforge.net/
HREF9
http://www.gbif.org/
HREF10
http://salvias.net/pages/index.html

Copyright

Naumann I, Lumb E, Taylor K, Power R, Ratcliffe D, Kearney M, © 2006. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.