Yanbo Deng, Walt Abell, Clare Churcher, Applied Computing Group,
ESDD, P.O. Box 84, Lincoln University [HREF1],
Canterbury, New Zealand
{dengy6, abell, churcher} @lincoln.ac.nz
John McCallum, Crop and Food Research [HREF2], Private Bag 4704, Christchurch, New Zealand McCallumJ@crop.cri.nz
Databases for scientific analysis often need to be integrated with a variety of different applications. While there are many ways to retrieve and analyse data already in a database, it can be more difficult to load data which may be stored in a variety of different formats and software (e.g. spreadsheets, other databases, statistical packages). This project investigates how to provide the ability to customise data entry and loading for Germinate, a database for plant genetic information. The approach uses web services to provide a platform and language independent way to provide centralised data loading and maintenance functions. These functions can be called from the various applications that provide source data for Germinate. The web service allows developers and competent end users to customise data entry to fit the way they have structured their data with minimal effort.
Databases are increasingly common in organisations large and small. The advent of open-source databases has allowed small groups or communities of users to develop their own systems, for example communities of scientific or research users. Many of these databases are developed and maintained as side-lines to the main activities of the organisations to which the users belong. The perceived usefulness of these systems and comparatively low cost of downloading and installation means that they may be taken up by users at different organisations who carry out similar activities.
Systems developed by small user communities may not have all the support and features found in commercial systems. In particular, users in these communities are often more focused on the ways of using the data rather than on how to load and manage the data in the first place. This means that data loading and management may be haphazard or difficult for end-users and may in fact be left to the “power users” within an organisation. The level of expertise required and work time-constraints can lead to delays and frustrations in making the most use of these systems.
Data integration has been a long standing problem for business and research organisations (see for example [HREF3] and [HREF4]). Much of the research in data integration revolves around the efficient aggregation and querying of data across multiple heterogeneous databases. However, the issues involved in how data is initially loaded into a database and subsequently maintained are often overlooked.
Data loading can be the most time-consuming and costly part of data management [3]. The standard facilities for inputting data range from simple data entry forms to “screen scrapers” to complex ETL (Extract Transform Load) tools. There are also specialised tools for designing scientific workflows (SWFs) [HREF5] which allow extensive calculations as well as combining data from multiple sources. Indeed ETL and SWF tools provide extensive facilities for cleaning, transforming and rearranging data from a variety of sources and provide for automation of the loading process. A central part of Data Warehouse systems, the ETL process can consume a substantial amount of staff and computing resources. In addition, many ETL tools are proprietary and their costs prohibitive for small communities of users.
Some scientific communities may not have large volumes of data to record (as compared to commercial organisations) thus reducing the imperative for automated procedures. In many cases, the source data may be stored as part of spreadsheets or small databases set up to support particular projects. There may well be a need to incorporate data from multiple projects, each with a potentially unique organisation of data. Other sources/formats may arise if data is obtained from outside the organisation or is generated by other software or specialised lab equipment.
Small scale scientific databases may not have particularly user-friendly ways for loading new data. A simple import facility, using a standard format such as text, XML or a spreadsheet template may be all that is provided. Typically data must be extracted from the original source and put into the specified format, then loaded into the database via scripts and/or SQL commands, sometimes on remote systems. A novice end-user may well find the process difficult to perform and any error messages hard to interpret.
There may be little in the way of support for data maintenance (to update or delete data) and the only way to do this may be by using standard database tools (table editors, SQL commands). Even minor changes can require a good knowledge both of the tools and the database structure. The result is that database loading and management is often left to power users (or the database administrator if there is one) as normal end users may not have the skills or understanding to perform the tasks involved.
Even if a user-friendly method for loading data is provided (e.g. a web form), it may not cater for the variety of data sources and formats used throughout the user community. And as new projects are undertaken, new data arrangements may arise. Thus, users may need to create procedures to extract data from each new data source and transform it into the format that the database loading process expects. This can be time-consuming and error prone and must be repeated whenever new data is acquired as well as potentially reworked for each different data source.
In addition, changes to the database implementation over time (e.g. table redesign, migration to another database system) may well necessitate changes to the data loading application. This could mean updating all the procedures for loading data from each data source.
The preceding problems are an example of a common maintenance issue where applications have to be customised in some way to interact with a central database. One approach to deal with this is to implement suitable middleware between the applications and the database. The middleware acts to simplify the programming interface by hiding the details of the database structure. It can also implement the logic and business rules to be applied across all applications that interact with the database. Middleware also prevents purely internal changes to the database affecting end-user applications. In this case, only the middleware code would need to be updated.
This paper looks at the use of web services to provide the middleware for supporting customised data loading and maintenance applications. A web service is an API hosted on a web server that provides a set of well-defined services to client programs. Web services provide a remote procedure call mechanism with requests and responses encoded in XML. Web services are increasingly used for data integration [HREF6] as they enable platform and language neutral data access, especially to scientific databases (e.g. weather data [HREF7] and genetic sequence data [HREF8]). However, the use of web services to support data input and management is still comparatively rare, although there are some notable exceptions, mostly in the commercial arena. The Google Adword web service [HREF9] allows client applications to manage online advertising accounts and Amazon offers eCommerce and related functions via web services [HREF10]. A web service architecture has also been proposed to provide a uniform interface for data sources used in an ETL system [HREF11].
The use of web services has advantages over more traditional approaches, e.g.
DCOM or Java RMI data access components. Web services are standards-based, and
interoperable between clients and services on different platforms or in
different languages. Indeed, many development systems now include tools to
automatically generate code to provide or access web services (e.g. for Java,
VBA and PHP). Web services also use standard TCP/IP protocols for communication
so are easily hosted locally or at remote sites.
There are some potential issues with web services, the main ones being
additional processing overheads and security considerations (especially for data
updating). However, these issues exist with other middleware solutions and do
not seem to have slowed the adoption of web services for data access.
Germinate [HREF12] is one example of an open source, non-commercial database system with a international community of users. Geminate is a data integration system for storing plant genetic resources and related information. The fundamental data in Germinate is “Passport data” which includes plant accession information (a description of plant samples). This is associated with phenotype and genotype data as well as other plant data. The Germinate system was developed by the Scottish Crop Research Institutions and made freely available for use by the research community. The current system is implemented in PostgreSQL but there are plans to allow use of other databases to store the data.
There are already a number of open-source tools available for analysing the data stored in Germinate including those that use web services to retrieve the data, e.g. the GDPC browser for viewing genetic data [HREF13]. However the facilities for loading and managing the data are fairly primitive. The data model for Germinate is complex (72 tables) with many rules and referential integrity constraints. The current data loading system requires users to put new data into an Excel spreadsheet or Access database template. This must be transferred to the server where Perl/SQL scripts are used to load and check the data (see Figure 1). Any problems detected (e.g. duplicate ids or incorrect data formats) are noted in a log file. This file is just a list of the PostgreSQL error messages which may be unintelligible to a novice user.

Figure 1 – Current data loading procedure for Germinate database
In addition, Germinate does not provide a user-friendly facility for updating the data once it is loaded. This must be done using the standard PostgreSQL management tools (table editor and/or SQL), navigating the complex design and dealing with the many referential integrity constraints. Again, this task is beyond the ability of most end users.
The New Zealand Institute for Crop and Food Research (CFR) obtained a copy of Germinate for recording and analysis of data about their research crops (mainly onions and oats). The research data at CFR is stored in a variety of sources which correspond to various on-going research projects. Some of the data are stored in Excel, Access, text files and other specialised applications (see Figure 2).

Figure 2 – Example of various plant data sources at CFR
This approach creates numerous difficulties. Each different source of data has to be exported and converted into the appropriate format for loading. This can be a multi-stage process, with each user getting the data into a format suitable for their work, then exporting this to Excel or Access, and then transforming the data so that it conforms to the loading format (this last step is often carried out by the database administrator. At CFR, this sometimes means data traverses several systems and formats before being ready to import into Germinate.
At each stage in the exporting process there is potential for data errors to be introduced. The process is also not automated which means that it must be manually repeated for each new set of data to be imported. Any data validation problems are only detected when the scripts on the server run. The error messages are not easily understandable by end users so have to be interpreted by the database administrator. Some of the errors detected may need to be corrected in the source data and this may result in having to repeat the conversion process.
In addition, new research projects create potential new sources of data for Germinate. The loading procedures may have to be adapted if the data for a new project is stored in a different application or uses a different format. Developing and carrying out the importing processes are both time-consuming and error-prone and there are only one or two users capable of performing them. There is also no end-user interface for editing or correcting data once it has been imported. This means that the database administrator must also make any corrections to existing data using the PostgreSQL table editor.
The web service we have developed sits as middleware between
client applications and the Germinate database. It is accessible from any web
service enabled language on any platform via the network. For example, client
applications using various tools (e.g. Excel or web clients) can be easily
adapted to use the web service functions to interact with the database. This
eliminates many of the export and transformation processes and significantly
reduces the effort for users to get their data into Germinate.
The system developed (see Figure 3) includes four parts, of which we provide
parts two and three. We have also created sample user applications (part one)
for testing purposes.

Figure 3 – Parts of the web service data management system
A key goal was to minimize the effort of adding the web service functionality to individual applications. To do this, we provide client-side toolkits for a number of languages and platforms which handle communication with the web service and provide a simple set of functions to the programmer for inserting, deleting, etc. The toolkits also accept application specific formats (e.g. Excel dates) and automatically convert these to the formats required by the web service. Only a few lines of code are needed to pass the relevant data from an application to the toolkit in order to invoke a web service function.
Germinate can be thought of as storing several logical groups of data, e.g. accession data, genetic data and trait data. These may be entered or updated at different times, e.g. when samples are recorded (accession data) and when laboratory test results become available (genetic data). The web service provides the client application with a simplified view of the main groups of data. It hides the detail of the database schema and eliminates the need to program complex SQL operations. The web service provides the essential functionality to load and manage logical groups of data as well as providing understandable error reporting.
Inserting data is the most complex operation. E.g., a full accession record includes 34 pieces of data, some of which need to be validated and/or added to 22 different tables. The web service inserting function accepts a logical accession record from a client application, validates the data, and directs it to the correct database tables via SQL. If invalid data is detected, the web service returns an appropriate error message to the client toolkit to be passed back to the application. The web service also provides select and delete functions to allow specific records to be retrieved and/or updated. The select function could also be used to extract data from Germinate for further analysis (e.g. for use with a data visualisation tool).
The web service component was developed in PHP version 5 which contains extensive support for web services. The PHP PEAR DB_DataObject package was used for database connectivity and the NuSOAP library for encapsulating the web service communication in XML and for producing the WDSL document for the web service. The web service runs on an Apache web server on a Linux system which also hosts the Germinate database running under PostgreSQL version 8.1.5.
At CFR, the main data loading applications use Excel or Access so a client toolkit was implemented in Visual Basic for Applications (VBA). Microsoft provides support for generating VBA function libraries from the WDSL for a web service [HREF14]. This was used to assist creation of our client toolkit. A client toolkit for PHP was also developed to allow creation of a web interface for updating records in the database. The location of the client applications, web service and toolkits are shown in Figure 4.

Figure 4 - Architecture of the web service and toolkits
Each client toolkit contains language specific functions for invoking the corresponding web functions (e.g. TK_Acc_Insert calls the web service to insert accession data). In addition, the toolkits contain an initialisation function that prepares the client for calling the web service and checks that the database is operational. Some data conversions are also carried out by the toolkits (e.g. from Excel to PostgreSQL date formats). At this stage, the web service only provides loading and management of accession data but it is planned to support the other Germinate data as well.
The web service and toolkits for VBA and PHP were developed and tested for functionality by creating a simple Excel data template and a web form client application. Documentation was also developed for use of the toolkits and the Excel template. Preliminary user trials were conducted with both end users and developers to assess the usefulness of the system.
The Excel template was trialled by typical Germinate end users familiar with accession data. The trials consisted of two tasks:
The users were able to successfully load data into Germinate with minimal assistance. They were also able to correctly interpret the error messages for invalid data and take corrective actions. The users were also able to load the large record set but a comment was made that there was no visual indication of what was happening (e.g. a record count or status bar).
The trial for developers involved getting them to work through the documentation to create a simple Excel template. The developers found it easy to follow the directions to create a simple application. One developer went further, creating a customised application of his own design. Several suggestions were made for improving the documentation. The overall feedback was that the web service approach seemed a useful and efficient way to create data loading applications.
The main advantage is the ability to create custom data maintenance applications relatively simply. For example, CFR end users can load data directly from their project spreadsheets and databases without the need to export or significantly transform the data. For example, a spreadsheet with the client toolkit included can be programmed to find the required data from specific columns and/or rows. There is also a simple web interface to update the data which can be further developed as the need arises.
More importantly, the database functionality is easily included in these applications because the bulk of the code and logic is in the web service and toolkits. For example, the VBA code to insert data via the web service is considerably less than what would be needed using standard Microsoft Data Access Components to directly access the database. Hence the barrier for developers to create custom data entry applications is much lower.
Since the web service hides the detail of the database implementation, it will
be easier to handle changes to the Germinate schema and/or the underlying
database technology. If a change is purely internal, only the web service will
have to be updated; the client applications will continue to operate unchanged.
If new functionality is added to Germinate, then there will be changes to the
toolkits and hence the applications. However, the changes to the applications
themselves will be minimal
A limitation of the web service approach is that it requires some users be able to handle basic programming to create applications using the client toolkits. While this may be beyond the skills of novice users, there are often “end user developers” in an organisation who know how to write programs or macros. In addition, once a sample application (e.g. an Excel spreadsheet) is created, it can be copied and adapted by other users for their needs.
Another issue is customisation of the data access web service. Not all organisations use Germinate in the same way. For instance, CFR currently has no use for the Accession data that records the geographical location of a plant sample. Entry of this data needs to be optional (or disallowed) for CFR but this will not be the case for all organisations. We intend to allow the web service to be configured so that data elements can be marked as required or given default values.
The impact on performance must be considered when adding middleware layers to a system. Testing with a simple PHP script shows that inserting an accession record with the web service is approximately three times slower than directly accessing the database. This appears to be mainly due to the processing overhead for XML encoding and decoding, however the validity checks performed on the data may also affect the performance. We intend to explore ways to improve performance by compression of the XML stream and providing bulk loading facilities, e.g. the ability to submit multiple records for loading in one web service call.
It should be noted that web services are now routinely used to retrieve data, and performance has not been cited as a strongly negative factor in the process. For high volume data loading (e.g. in a commercial transaction system), a simple web service could well be too slow. In the case of Germinate, the amount of data loaded is relatively low and very small in comparison to the amount that will be retrieved for various analyses. For small volumes of data, the overhead of a web service appears a reasonable trade-off for the increased flexibility of data loading and maintenance.
Using the web service approach, it becomes relatively easy to add other data entry options. One option requested by CFR is direct data entry from mobile devices, e.g. a cell phone. This would be very useful as accession data is often collected in the field. A prototype for data entry from a cell phone has been developed in Java and simulated. However security issues (see next section) will need to be dealt with before a trial of this option can be conducted.
The trials of the Germinate web service have not needed to use security as the system has contained only test data and has been hosted on an intranet-only server. However the web service and toolkits support user names and passwords and various authentication methods (e.g. basic, digest). SSL support could be added to the web service and toolkits to provide encrypted data transmission. Alternatively, a VPN connection could be used to provide secure access for off-site CFR users.
At present, there is no intention that the database be made available outside CFR. However, there are international repositories of Germinate data which could benefit from a secure data access. This is particularly important if updating the database is to be allowed. Appropriate safeguards must be in place to prevent accidental or intentional pollution of the data.
Germinate already includes some facilities for recording and auditing user access. It may be desirable to extend these through to the web service to provide for a set of security roles, enforced with appropriate authentication. Example roles could include:
This project investigated the potential benefits of using web services to enable data maintenance from familiar user applications. A web service and client toolkits were developed as middleware to the Germinate database. This provided a simple programming interface for interacting with the database and proved easy to add to existing applications. More importantly, there was a reduction in the amount of data exporting and reformatting required for users to get data into Germinate. This means that less effort is spent on data maintenance and more time can be spent on analysis of the data. There are several extensions to the system proposed, including adding security roles and improving web service performance and functionality.