Defining a Web Services Interface for a Software Package Repository

Steve Cassidy [HREF1], Senior Lecturer, Centre for Language Technology, Department of Computing, Macquarie University. Email: Steve.Cassidy@mq.edu.au

Abstract

This paper describes work defining and implementing a web based interface to a software package repository for extensions to the Tcl scripting language. A design goal is to allow access via a web browser or programmatically; this provides an interesting case study in the current REST/SOAP debate on the best form for web services interfaces.

Background

Tcl is a scripting language first developed by John Ousterhout and now maintained by a group of open source developers [HREF2] . Significant users of Tcl are in web content management (AOL), test automation (CISCO, GN Nettest) and EDA/CAD (Synopsys Inc, Model Technologies) as well as a large range of others [HREF3]. Tcl is often embedded within applications to provide scripting facilities (eg. the Emu Speech Database System (Cassidy & Harrington, 2001)).

As with many other programming languages, Tcl functionality is divided between a fixed set of core functions and a wide range of loadable extension packages or libraries. In recent years, the Tcl package system has matured and the means of building and installing these packages has become standardised. However, there is not yet a canonical repository of Tcl packages analogous to CPAN [HREF4] for Perl and CTAN [HREF5] for TeX. This paper describes some aspects of the design of such a repository for Tcl.

The core functionality provided by a software repository is to act as a store of software packages accessible over the network via FTP or a web interface. This in itself provides some interesting design issues. However, it should now be expected that any such repository would provide some kind of programmatic interface to non-browser clients; for example, to enable automatic download of packages and queries for new package versions. All of these are design goals of the project described here.

One of the interesting things about designing a software repository at the present time is the ongoing debate between the proponents of SOAP [HREF6] and REST (Representational State Transfer [HREF7]. Briefly, this discussion centres around the benefits of a procedure-call based web services interface vs. one based on retrieving, querying and updating resources using the existing HTTP protocol. In the early stages of this project, SOAP seemed like the obvious solution to providing programmatic access but in the end the project has provided an interesting proving ground for some of the issues discussed in that debate.

This paper will first outline the structure of Tcl packages and the provision of package meta-data, then the design of a web based interface to the repository is discussed. Finally some issues relating to security and to mirroring of the repository are covered.

Existing Repositories

There have always been file repositories on the Internet, since before the days of the web there were many FTP sites carrying downloadable software, shareware and information. Later, the Archie service (Deutsch and Emtage, 1992) was developed to index and provide search facilities for many FTP sites based on filenames only. The Gopher service (RFC1436) provided a menu based interface to file browsing on the Internet, allowing downloadable files to be organised into hierarchies and accompanied by help documents. The advent of the Web has obsoleted many of these services, although the core FTP service remains an important part of the infrastructure of the Net. Internet search is now subsumed by web search engines like Google [HREF7] and Teoma [HREF8] which can index any site accessible via an HTTP or FTP URL. Many file repositories are now run on the web, the majority of which provide shareware or freeware applications for download. These are by and large designed as a web front end to a file archive and provide for navigation through the archive and simple HTTP or FTP download of files.

With the advent of Free operating systems (eg. GNU/Linux and FreeBSD) and the proliferation of open source infrastructure there has emerged a need to provide repositories of both applications and libraries, and to record the interdependencies between these. One of the first such systems was RPM, developed by Red Hat [HREF10] for their GNU/Linux distribution. RPM files were packaged software which contained metadata describing the application or library and its dependencies. This metadata allowed the RPM system to verify that a package would work if installed and notify the user of required dependencies if they were not met. RPM packages were distributed on CDROM in the Redhat Linux product and were made available for download via various repositories. If a user is installing a package and an unmet dependency is notified (eg. a missing version of a system library) then the user can search a web repository such as RPMFind [HREF11], locate and download the missing package and install it before finishing the original install.

The later Debian [HREF12] system encoded similar information in their deb package format and automated the download of dependent packages so that a simple apt-get install mozilla command will install the mozilla application and any missing dependent libraries. Recent versions of Redhat Linux have also included such a feature for the RPM system.

In order to provide these services, the package installer must be able to firstly determine which dependent packages are needed and then to locate them in the remote repository. To accomplish this, metadata must be associated with each package in the repository which describes what it provides and what other packages it requries for proper use. This metadata can also be used to store more descriptive information about each package and so help to provide a more useful search interface for the repository.

Tcl Packages: Designing Meta-Data

In Tcl, a loadable package may consist of Tcl code files, shared library (DLL) files and associated documentation and script examples. There is no fixed layout for packages and package loading is controlled by a single file pkgIndex.tcl which can contain arbitrary Tcl code to load the package. Typically, packages are installed as subdirectories of a standard location (eg. C:\Program Files\Tcl\lib on Windows); the standard Tcl package discovery mechanism will find the pkgIndex.tcl file here and execute its contents to load the package.

In order to provide an automated install procedure for Tcl packages, some metadata must be associated with each package to allow management of package dependencies and to provide essential information to users browsing any package repository. Since no such metadata has ever been in use in the Tcl environment, we have a clean slate to develop an appropriate metadata standard.

The obvious choice for a metadata storage is RDF [HREF13], the metadata standard developed by W3C which is at the core of the Semantic Web effort. RDF stores descriptions of resources as triples of resource-attribute-value where the value itself can be a resource. For example, we might say that the package stemmer has an author of Steve.Cassidy, and that Steve.Cassidy has a name of Steve Cassidy. Here both the package and the person are resources being described.

For this application, we need to list various attributes of software packages. Two important aspects of the design of this metadata are the metadata fields or attributes and their meaning, and the file format used to store and transmit it.

The requirements for a metadata file format are that it be easily composed by developers and easily parsed by package handling software. The default choice might be XML since RDF has a well defined XML syntax [HREF14]. However, while XML is easy to parse it can be somewhat obtuse to write by hand, this is particularly true of the RDF/XML Syntax. Since there is nothing about RDF that requires XML, we are free to use a different file format. The format chosen is that used in email message headers as defined by RFC822 [HREF15] where attributes-value pairs are written as follows:

Identifier: installer
Version: 0.6
Title: Tools for building installation applications with Tcl
Creator: Steve Cassidy <steve.cassidy@mq.edu.au>
Description: Provides a set of procedures to download 
 and unpack archives, display progress to users, manage
 Tcl extension packages and perform other installation 
 related work.

This data, when stored in a file associated with each package provides all of the attribute-value pairs for that package. This can be stored as RDF and trivially serialised as XML if needed for export to other software. This is the same metadata file format as used by the Debian project [HREF12] and CRAN [HREF13] for describing software packages.

The metadata fields required to describe packages are more difficult to define. The general experience in this field is that standardised vocabularies should be used wherever possible to aid in the interoperability of metadata sources. As an example, the Dublin Core Metadata Initiative [HREF17] defines the Creator field to hold the name of An entity primarily responsible for making the content of the resource; if the same name is used in this application, package metadata will be interoperable with that for many other applications. The use of common vocabularies in this way is at the core of the W3C Semantic Web [HREF18] initiative.

The names of fields used to describe packages has been taken from those defined by the Dublin Core initiative where the DC definition is appropriate. In other cases new names were used which clearly denote the contents of the field. The main fields included are briefly:

The full set of field names and definitions is given in Cassidy (2001).

The repository requires that this metadata is associated with each package and in the structure proposed here, each package will contain a file called DESCRIPTION.txt in this format containing the required data.

Repository Requirements

The core requirement for a software repository is to be able to upload, locate and download packages. These three requirements are expanded upon below. Overarching these requirements is the need to provide both a Web/HTML based interface for manual browsing and download and an interface suitable for automated clients.

File Download

While this would seem to be the simplest requirement there are a number of interesting extensions to the basic idea of file download supported, for example, by most FTP or HTTP based file archives. The archive should support delivery of packages in a number of different package formats. For example, ZIP for Windows, TAR for Unix and BINHEX for Macintosh. Since the same Tcl package can in many cases be used on any platform the repository should know how to generate any of these output formats from whatever version is stored. Tcl also has a directly executable archive format called Starkit [HREF19] (Landers & Wippler, 2002) which can include more than one library package and application code; the repository should be able to construct these automatically from the individual packages and deliver them to the client.

As packages are developed different versions of the same package will be stored in the repository. A user may request either the latest version of the package or refer to it by a specific version number. For example, a user requesting installer.zip might get the latest version of the installer package while installer0.4.zip might get the specific version 0.4.

These requirements together mean that the specific file requested for download might not exist on the server and that the repository must either convert the file format or find the file matching the request.

Repository Search

Clients need to be able to locate packages not only by name but according to other criteria; this is one of the reasons for associating a rich metadata set with each package. Many existing repositories offer a web based search which uses package names or metadata to locate files in the archive; however, none of these extend this interface to automated clients. For example, the Debian package repository can be searched via a web form [HREF20] but in order to search the archive from a program, the client must download a file containing the metadata for the entire archive and search it locally.

File Upload

For a repository to be successful it must encourage upload of packages by software authors. However, the archive must be able to engender the trust of both package authors and package users that the files uploaded are not tampered with.

A URL Based Repository Interface

We come now to designing the interface to the repository that will be used by people via the web and by automated clients. As was mentioned earlier, an early design envisioned a SOAP based interface to query and upload packages to the repository. More recently the advantages of a simpler HTTP GET/POST based interface have become apparent.

SOAP is a remote procedure call standard which commonly uses HTTP as its transport layer and XML as the message encoding format. A SOAP client opens an HTTP connection to a server and sends an XML document describing the procedure call being made as a POST request, including any arguments being passed to the call. The server responds by calling the named procedure and returning the result again encoded in XML. SOAP has the advantage over some other RPC interfaces of being relatively lightweight and platform neutral and SOAP toolkits have been implemented for most popular languages, including Tcl.

Using SOAP, one might define a repository interface consisting of procedures like get_package, query and upload. All of these would be made available via a single URI which defined the canonical access point to the repository.

Recently, a very lively discussion has been taking place between the advocates of SOAP and those of an alternative web services architecture named REST (Representational State Transfer, Fielding, 2001). One of the primary arguments for REST is that HTTP is a very capable protocol designed for providing read, write and query access to resources and that many web services could be framed in terms of HTTP instead of the additional complexity of the SOAP protocol. Using HTTP in this way makes better use of the infrastructure that has been built up for the Web over the past years. One important example of this is web caching; if a request for a resource is sent via SOAP and the resource returned encoded in a SOAP return message, then a web cache will not be able to store a copy to use in later requests. A web cache will only store the results of an HTTP GET request and SOAP requests are made via the POST method. The REST philosophy says that if you are requesting a fixed resource (in our case a package) then you should use a GET request and ensure that any intervening web caches can cache the resource safely. Since the interface being designed here is one providing access to a set of fixed resources, it seems like a good candidate for a REST style interface. The remainder of this paper describes the interfaced that has been designed in this manner.

File Download

File retrieval is achieved via a standard HTTP GET request for a URI which includes the package name and optionally a version string; the package format delivered is determined by the file extension in the URL. All package download URIs are logically in the /package/ subdirectory of the repository. For example, the URI:

http://purl.org/tcl/cantcl/package/installer0.4.zip

would retrieve version 0.4 of the installer package in zip format from the repository at http://purl.org/tcl/cantcl. The latest version of the same package in Starkit format can be retrieved at the URI:

http://purl.org/tcl/cantcl/package/installer.kit

One consequence of this choice is that a fixed version of a package has a fixed URI on each repository. This enables web cache software to treat them just like any other web document and cache them where appropriate. The actual contents of the 'newest version' URI will change with time but the HTTP Last-Modified header will reflect the time of the last change to the package and so web cache software will operate as normal.

Note also that the use of a Permanent URL [HREF21] which can be redirected to any live repository ensures that each software package has a permanent location through which it can be accessed.

One additional feature of HTTP that could be utilised here is content-type negotiation. One of the fields sent in most HTTP requests is an Accept field which names the media types that the client will accept from the server. This field could be used in this application to denote the format of the package required. Eg. Accept: application/zip would result in a zip package being returned while Accept: application/starkit would return a starkit. A theoretical advantage here is that the URI requested identifies the resource while the HTTP header adds additional information about how the resource should be returned. While this might be (and I'm not sure that it is) a `purer' form of a REST interface it would make writing clients more difficult and so detracts from the overall usability of the interface.

In addition to the whole package retrieval URI, additional URIs will provide access to package metadata and to individual components inside the archive. Package metadata can be retrieved in XML/RDF format via the URL:

http://purl.org/tcl/cantcl/package/installer0.4.xml

or in a text format (as described above) using the .txt extension. The contents of the package can be examined or downloaded individually by treating the package name (with optional version) as a directory on the server; so:

http://purl.org/tcl/cantcl/package/installer/tcl/cantcl.tcl

would return one tcl source file from within the latest version of the package. It is intended that in a future implementation, this interface will conform to the WebDAV standard [HREF22] which allows remote resources to be mounted as remote disk drives (eg. Web Folders in Windows or iDisk in MacOS). With the appropriate software tools, which already exist in Tcl [HREF23], this will allow loading of packages from the repository without an explicit download/install.

Browse/Search

The browse/search interface uses the metadata associated with packages to allow complex search criteria and browsing according to package name, platform, category, etc. The interface to these operations is implemented as an HTTP GET request; search parameters are encoded in the URI as a query string. The result of a query is a document containing URIs which either refer to individual packages or which represent more specific queries on the repository. The document format can be HTML, XML or text and is controlled by elements of the request URI.

Queries can take the form of patterns to match against any of the metadata fields defined in the repository metadata set. These patterns can be either keywords or use glob style patterns (ie. containing *, ? etc.).

For example, to locate packages for the Linux platform the URI would be:

http://purl.org/tcl/cantcl/browse?architecture=Linux

This would return, in the default HTML format, a document listing package URIs matching the query:

<html>
  <head>
    <title>CANTCL Search Results</title>
  </head>
  <body>
    ...
    <ul>
      <li><a href="http://purl.org/tcl/cantcl/browse/installer0.4"></a></li>
      <li><a href="http://purl.org/tcl/cantcl/browse/installer0.5"></a></li>
      <li><a href="http://purl.org/tcl/cantcl/browse/tcllib1.0"></a></li>
      ...
    </ul>
  </body>
</html>
      

The individual browse URIs referred to this document would retrieve an additional result document with links to specific formats available for download (eg. a link to the zip version and the starkit version). The returned results can be made more specific by including more query terms in the original URI. For example, to get a set of links to zip versions of packages which mention HTTP in their descriptions:

http://purl.org/tcl/cantcl/browse?description=HTTP&format=zip

The use of HTTP GET in for this interface is entirely consistent with its original specification and meshes well with the existing machinery of the Web. They are both safe (in that there are no state changes induced by each request) and idempotent (repeated requests for the same URI will retrieve the same result, until the contents of the repository change). These browse URIs can be generated via an HTML search form or from a client program. The results of browse URIs may be chached where appropriate -- that is if the HTTP headers such as Last-Modified are set appropriately.

The main alternative to the use of GET here is to use a POST request and encode the query not in the URI but in the HTTP header. From the HTTP specification, POST should be used where the additional information being transmitted (in this case the query) is subordinate to the request URI; this is clearly not the case here. POST was intended as a way of updating resources, for example, adding messages to a bulletin board. The common use of POST for web queries is largely due to the desire to hide the query terms from the browser location bar.

File Upload

Two HTTP operations, PUT and POST, can be used to modify or create new resources and so are candidates for use for the file upload operation. PUT is intended for the creation of new resources and is commonly used to upload documents to a website; the URI specified in the PUT request is intended to be the URI from which the document can be later downloaded. A POST request is intended for creating a subordinate resource to the request URI and the common use of POST for file upload from web forms matches well with this interpretation. Hence, for this interface, POST is chosen as the operation invoked for file upload.

An important advantage of choosing POST is that file upload can be invoked from a traditional HTML form as well as programmatically.

All repository uploads are sent to a single URI:

http://purl.org/tcl/cantcl/upload

The person uploading a file to the repository must provide a valid username and password (that is, they must be a registered user of the repository). The identity of the user is then associated with the package metadata. Uploads can be made over a secure (SSL) link where the server supports this.

The package uploaded is checked for consistency, including the presence of the required metadata and is stored in a holding area for approval. Once approved, the upload is moved into the main repository and is available for download.

Security

In any modern Internet application, security and trust must be primary considerations. The primary threat to a file repository is corruption/modification of existing packages and upload of malicious packages. No specific measures have been taken to prevent compromise of the website; we rely on the usual practices of maintaining an up to date, secure server with controlled access.

In the initial implementation of the CANTCL repository, file uploads are audited manually to ensure against malicious packages. Clearly this method will not scale as the repository gains more packages and a more distributed means of ensuring trust must be adopted.

In the future, it is envisaged that the repository will make use of digital signatures to ensure that the person uploading a package is who they claim to be. This gives us some level of trust in package contents and would, for example, allow us to trace back a malicious package to an individual user and hence identify all other packages that might be suspect. In addition to this, the repository itself may digitally sign packages that it delivers to enable end-users to be assured of the authenticity of the package. The expectation is that with these or similar tools, the community could maintain the credibility of the repository.

Summary

This paper has described the design of a web based repository of software packages which uses an HTTP based URI interface to allow both browser based and program based access to package information. This repository is currently in use by the TCL community and can be accessed at the URI: http://purl.org/tcl/cantcl/.

The design described shows that there are specific advantages to using the base HTTP protocol to provide access to a web based service compared to the use of the SOAP remote procedure call interface. Specifically, the interface described allows for caching of resources in the same way as normal web pages and provides for both browser based and program based access to the repository via the same URIs.

References

S. Cassidy and J. Harrington, (2001) Multi-level annotation in the Emu speech database management system Speech Communication, 33, 61-77, January, 2001.

P. Deutsch and A. Emtage.(1992) Archie - An Electronic Directory Service for the Internet. In USENIX Association Winter Conference Proceedings, pages 93--110, San Francisco, January 1992.

F. Anklesari, M. McCahill, P. Lindner, D. Johnson, D. Torrey and B. Alberti The Internet Gopher Protocol, Internet Engineering Task Force RFC 1436, http://www.ietf.org/rfc/rfc1436.txt

Cassidy, S (2001) Package Format for Tcl Extensions, Tcl Improvement Proposal #55. http://purl.org/tcl/tip/55.html

Landers, S. (2002) Beyond TclKit - Starkits, Starpacks and other *stuff, 9th Tcl/Tk Conference, Vancouver 2002. http://www.digital-smarties.com/Tcl2002/tclkit.pdf

Fielding, R. T. (2001) Architectural Styles and the Design of Network-based Software Architectures Doctoral dissertation, University of California, Irvine.

Hypertext References

HREF1
http://www.ics.mq.edu.au/~cassidy/
HREF2
http://www.tcl.tk/community/coreteam/index.html
HREF3
http://www.tcl.tk/customers/success/
HREF4
http://www.cpan.org
HREF5
http://www.ctan.org
HREF6
http://www.w3.org/TR/SOAP/
HREF7
http://internet.conveyor.com/RESTwiki/moin.cgi/
HREF8
http://www.google.com/
HREF9
http://www.teoma.com
HREF10
http://www.redhat.com
HREF11
http://rpmfind.net/
HREF12
http://www.debian.org/
HREF13
http://www.w3.org/rdf/
HREF14
http://www.w3.org/TR/rdf-syntax-grammar
HREF15
http://www.w3.org/Protocols/rfc822/rfc822.txt
HREF16
http://www.cran.org/
HREF17
http://dublincore.org/documents/dces/
HREF18
http://www.w3.org/2001/sw/
HREF19
http://equi4.com/starkit
HREF20
http://packages.debian.org/
HREF21
http://purl.org/
HREF22
http://www.ietf.org/rfc/rfc2518.txt
HREF23
http://mini.net/tcl/vfs

Copyright

Steve Cassidy, © 2003. The author assigns to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.