Taking RDF and Topic Maps seriously

- what happens when you drink the Kool Aid

Kent Fitch, AustLit Project [HREF4], Academy Library [HREF20], UNSW@ADFA [HREF21], Australian Defence Force Academy, Canberra, ACT, 2600. k.fitch@adfa.edu.au

Abstract

A great deal of attention has been focussed on the concept of the "Semantic Web". One of the core ideas behind the Semantic Web is the creation of machine-processable relationships between resource identifiers (URI's). Two often discussed ways of representing those relationships are RDF and Topic Maps.

This paper describes how the concepts and goals of Resource Description Framework (RDF) and Topic Maps influenced the design of the Australian Literature Gateway (AustLit) project.

Introduction

When Tim Berners-Lee states that his vision of the Semantic Web [HREF19] rests on the shoulders of RDF [HREF22], and Charles Goldfarb (one of the originators and prime movers of SGML) describes Topic Maps as the "GPS of the web" [HREF23] it is hard not to pay attention.

In 1998, whilst developing a web content management and delivery system to be driven by basic Dublin-Core style metadata I started to appreciate the power of what was promised by RDF and Topic Maps: the ability to describe relationships between resources, and even the ability to describe those relationships as objects themselves in a way that was quite independent of the base resources (that is, "outside of the plane" of the resources). This idea, the possibility of creating and manipulating a meta database which could evolve on its own with an existence and purpose separate from the resources it was ostensibly created to describe, seemed remarkable. I'd drunk the Kool Aid [HREF1].

This paper describes how the philosophies of RDF [HREF2] and Topic Maps [HREF3] influenced the design of the Australian Literature Gateway (AustLit) project [HREF4].

AustLit is a bibliographic and biographic system that represents its core application data using a new model from the International Federation of Library Associations (IFLA) known as the Functional Requirements for Bibliographic Record (FRBR) model [HREF5].

A brief introduction to RDF

A binary, directed relationship between 2 specified entities is an RDF "statement". This statement is often referred to as a "3-tuple" or "triple": the 2 "nodes" and the relationship (or "predicate") connecting them.

The statement is also often represented as a directed graph consisting of the 2 nodes and the arc (which is the predicate):

                   Predicate
     Subject  ------------------>  Object

Subject and predicate are always "resources" identified by a URI. The object can be either a "resource" or a literal. The predicate resource is also referred to as the "property", leading to the interpretation that the "subject" has as a "property" the value of which is represented by the "object".

A simple example using Dublin Core property references:

The URL "http://www.austlit.edu.au" has a title of the literal "AustLit Home Page":

  {http://www.austlit.edu.au, dc:Title, "AustLit Home Page"}

The URL "http://www.austlit.edu.au" was authored by the resource indicated by "http://www.austlit.edu.au/run?ex=ShowAgent&agentId=A$y=":

  {http://www.austlit.edu.au, dc:Creator,
          http://www.austlit.edu.au/run?ex=ShowAgent&agentId=A$y=}

The URL "http://www.austlit.edu.au/run?ex=ShowAgent&agentId=A$y=" has a description of "AustLit Agent - Kerry Kilner":

  {http://www.austlit.edu.au/run?ex=ShowAgent&agentId=A$y=, dc:Description,
          "AustLit Agent - Kerry Kilner"}

Graphically:

The RDF Schema specification [HREF6] describes how to define allowable combinations of particular subjects, predicates and objects.

For example, it may be important that an object meant to be representing a "publication date" in an RDF statement does in fact refer to a date, and that an object representing an "author of a book" does provide a reference to something which could author a book (such as a person, and not a date!). Further, there are probably sensible constraints to be defined about which predicates can be applied to which subjects - it is unlikely that you'd want to describe the "publication date" of a "person".

RDF Schema also allows the definition of class hierarchies, to specify, for example, that an "author" is a type of "person", but RDF Schema does not attempt to address more complicated yet typical validation requirements. For example, RDF Schema cannot support rules that specify that an author of a book must have a birthday, or that the author of a book must be more than 3 years old. A well known extension to RDF Schema that allows much richer expressions of relationships is the DAML+OIL language [HREF7].

A brief introduction to Topic Maps

Topic Maps are built from 3 different types of nodes: topics, associations, scopes.

A topic is simply a representation of any subject or concept of interest; it is the "proxy" of that subject in the topic map.

Topics may be resolvable to an object in the real world (such as the W3C home page, or the Sydney Opera House), or they may not be resolvable to an object, such as the concept "faith".

Topics have an identity, which can hopefully be used to unambiguously refer to the topic within and across topic maps, and maybe "link" it with the "real world" thing that it is representing.

Topics have characteristics. Characteristics are of three types:

Associations define relationships between topics.

Topic characteristics (names, roles and occurrences) can be asserted as being valid within a "scope", which acts as a context for assertions.

Although Topic Maps were standardized by ISO 13250 [HREF24], the Topic Map community has been challenged by syntax and semantic issues over the past few years. A good starting place for Topic Map information is Robin Cover's Topic Map page [HREF3] and Steve Pepper's The TAO of Topic Maps [HREF25]

RDF and Topic Maps - similar yet different

Both RDF and Topic Maps attempt to convey information by creating associations between identifiable objects. The main points of difference are:

AustLit - topics, topics everywhere

The Australian Literature Gateway (AustLit) maintains an extremely broad and deep bibliography of Australian literature and database of Australian authors and publishers. It is managed as a collaborative project by UNSW@ADFA, UQ, Monash, Deakin, Sydney University, Flinders, UWA, UC and is currently hosted at ADFA.

The database records information on over 385,000 works and 60,000 literary "agents" (typically authors and publishers). The project has combined the separate databases of the collaborating institutions into a unified web based system. The funding for the project has been supplied by 3 successive Research Infrastructure Equipment and Facilities (RIEF) grants awarded by the Australian Research Council [HREF11], and the system is now operating as a non-profit subscription-based service with over 50 subscribing institutions plus limited free public access.

The AustLit data model is no different to that of a typical IT application - entities with attributes and relationships between those entities.

But once you've drunk the Kool Aid, you stop seeing entities and attributes; the distinction disappears and you just see "topics" instead. And you stop seeing direct entity-to-entity relationships - instead you see topics playing roles in associations.

So, for example, rather than seeing an author "record" like this (in some pseudo DDL)

you instead start thinking about clusters of interacting "topics":

Topic A - of type "Agent"
Topic N1 - of type "Name", with name "Lawson, Henry"
Topic N2 - of type "Name", with name "Joe Swallow"
Topic N3 - of type "Name", with name "An Australian Exile"
Topic G - of type "Gender" with name "Male"

Topic D - of type "Date" with name "1867/06/17"
Topic P1 - of type "Place" with name "Grenfell"
Topic P2 - of type "Place" with name "NSW"
Topic P3 - of type "Place" with name "Australia"

Topic R1 - of type "association" name "nameAssociation"
Topic R2 - of type "association" name "genderAssociation"
Topic R3 - of type "association" name "birthEvent"
Topic R4 - of type "association" name "containment"

and "asserting" associations between these topics:

etc

Graphically, this looks like:

It is important to note that the associations are "first class objects". That is, they exist as topics in their own right, and can be involved in associations themselves.

Hence, it is possible to add relationships to this graph annotating the existing relationships, for example to say: "the probability of this birth event being accurate is estimated at 90% by Bill Bloggs, and he produces this note in the support of this assertion: '... '".

Adding this information to the graph could change it to look like this:

The AustLit data model and FRBR, INDECS, Harmony

The AustLit system is essentially a bibliography and repository of information about "agents" involved in the production of literature - authors, editors, illustrators, translators, publishers and printers.

When casting around for a model we could use as a basis for our design we were greatly influenced by the IFLA's FRBR work [HREF5], INDECS [HREF12] and the Harmony project [HREF13].

FRBR

In 1997 the International Federation of Library Associations and Institutions (IFLA) released an needs analysis of bibliographic systems, the "Functional Requirements for Bibliographic Records" abbreviated as FRBR.

The FRBR doesn't define a replacement for the long standing bibliographic transfer format known as MARC, but defines an approach to organising bibliographic material. The FRBR "teases apart" the concepts of Work, Expression, Manifestation and Item and in so doing helps model and understand the relationships between titles. For example, it can clearly identify two manifestations as embodiments of the same expression, or two expressions as realisations of the same work, although one may be a language translation of another.

The INDECS project

The INDECS project (INteroperability of Data in E-Commerce Systems) was established to develop a metadata framework for representing intellectual property and the transactions involving it. Their Schema and Model documents have a strong bent towards enabling e-commerce rights management related transactions, but necessarily require a precise modeling of the intellectual works and the agents who contribute to them. They also use the basic Work, Expression, Manifestation and Item representations of FRBR, but introduce the concept of the "Event" that describes how these products came about - who did what, the context, the inputs to the process.

The Harmony ABC strawman proposal

Many entities and relationships which different systems in different application areas attempt to model are pretty much the same. Rather than force each implementation to develop their own representations, and hence waste effort and complicate interoperability, the Harmony ABC proposal attempts to define a common framework which diverse systems will be able to use. The ABC proposal acknowledges that FRBR and INDECS were major sources of inspiration for their work.

Combining the models

From FRBR we took the deconstruction of a "bibliographic title" into work, expression and manifestation components. From INDECS and Harmony we took the fundamental importance of events; the idea for example that publication attributes such as publisher and date are not attributes of a manifestation, but are attributes of an embodiment event which takes as input an expression of a work and outputs a manifestation. Similarly, that a "translator" through a "realisation" event takes a work or an expression of a work in one language as input and outputs a new expression in a different language.

From RDF and Topic Maps we took the idea of expressing all our information as triples; as relationships between two topics. So, for example, rather than the "conventional" author or work tables with a primary key and many dependent columns described above we essentially have 2 tables:

The basic name is used for short simple topic names such as the gender topic "male". However, this Topic table is essentially "sub-classed" to store richer topic names such as human names where we want to separate surname, forenames and title. Although not strictly necessary (we could create a separate topic for each of these and create three relationship types "hasSurname", "hasForename", "hasTitle") there comes a point at which reality and performance issues intervene and further deconstruction becomes counter-intuitive.

The flip side of the simple database structure is greater complexity, or at least verbosity in query formulation. Because each "attribute" is represented as a separate topic and relationship, and because "events" such as birth, death, work creation etc are separate topics and because a typical book involves the instantiation of creation, work, realisation, expression, embodiment and manifestation topics each with many relationships, queries built to resolve relatively simple questions such as "which novels were published by people with a Hungarian heritage in the 1980s" requires 10 relational joins to retrieve the works. It is a testament to the query optimizer that such a query takes only a few hundred milliseconds to execute.

By abandoning the static storage of "topic type" and instead dynamically inferring "type" from the roles played by topics and their partners in those associations, an even simpler and more flexible structure is achievable as argued by Parsons and Yang in Emancipating Instances from the Tyranny of Classes in Information Modeling [HREF8]. Indeed, the AustLit system rarely uses the "topic type" attribute. Instead, for example, to find authors born between 1930 and 1940 who have written novels, the system looks for topics which

However, typing of topics is very handy during the data entry part of the AustLit system to constrain assignments; for example, the topic which plays the "is form of" role in an association with a work is restricted to those topics with a type of "form" - novel, poetry, drama etc, so strong typing for "simple" topics which typically only partake in a very small range of association types is necessary. The Parsons and Yang Kool Aid is strong stuff indeed!

Having retrieved the works, the full representation of those works (all their direct associated relationships, and the relationships of related topics such as creation, authors, expressions etc and so on recursively) is compiled and represented by the system as an XML document which is passed through a nominated XSLT stylesheet for the required formatting. Currently supported formats are HTML, plain text, tagged text for importing into a citation manager and native XML.

Detailed design information is available from the AustLit Project Documentation site [HREF14], [HREF15].

Experiences

As mentioned above, the AustLit datamodel is an amalgam of the FRBR's deconstruction of the bibliographic "title" into work, expression and manifestation, the INDECS and Harmony event-centric approach and the RDF/Topic Map model of using simple named relationships between resources to build rich information structures.

The FRBR model has proven to be very effective and easy to use. Whilst we and others have encountered issues regarding the relationships between works, expressions and manifestations and the most appropriate position for some attributes, the guiding principles of the model have served us very well (see AustLit: A Gateway on Steroids, Dr Marie-Louise Ayres, Digital Resources for Research in the Humanities Conference, Sydney 2001 [HREF16], and FRBR and the revision of the Italian Author Cataloguing Rules (RICA), Isa de Pinedo and Alberto Petrucciani, ELAG 2002 Conference, Rome [HREF26]).

The emphasis on unique identification of entities and the event based approach of INDECS and Harmony provided valuable insights during our modeling phase, and meshed well with the idea of associations as "reified relationships" as promoted by the Topic Map model.

The technical challenges of the project have been in implementing as "pure" a model as possible whilst recognising technical realities, and the resulting compromise has been generally successful. We've found that most queries, even those requiring 20 or more relational joins, execute quite quickly (under 5 seconds), and that recomposing an entity such as a "title" from its creation, work, realisation, expression, embodiment and manifestation components each with its own set of properties and attributes records is quite practical even with modest hardware [HREF17].

By just thinking about simple "attributes" as "topics" and being forced to invent property "topics" to associate them with heavier-weight topics helps to clarify the nature of the relationships and promotes flexibility in design. As technology improves and software designs evolve, techniques such as storing most if not all attribute-value pairs as triples becomes possible.

As an example of one of the clear advantages of such an approach, consider the simple case of recording the gender of an author.

Typically, such an attribute would be a clear-cut candidate as being a dependent column on the "author" table. The possibility of indicating that the gender was uncertain could perhaps be addressed by a kludge of inventing a new gender code (eg "M?"), or adding another attribute: an uncertain gender flag. Only if multiple genders were considered would the relationship between gender and author be "normalised out" (no doubt under sufferance) into another table. Even if it was so "normalised out", it would be unusual to be able to annotate each gender relationship, perhaps with a date range, or one or more notes each with an identified source. But by "reifying" the author-gender relationship as a "topic" it can itself play such roles in other relationships with other topics.

As a final example, consider the usefulness of the very simple concept of transitive relationships. The AustLit topic Katoomba has a "is narrower term" relationship with the AustLit topic Blue Mountains which in turn is narrower than Sydney (!) and New South Wales and Australia and so on. A query for agents born in New South Wales automatically follows the transitive relationships in which New South Wales participates, and hence will find authors born in Katoomba. When entered into the system, agents are often associated with a creation topic that is also associated with a spatial topic, such as Katoomba. That is, they are not associated with the string "Katoomba", but with a topic which quite independently partakes in a range of relationship types with other topics, and some of these relationship types are transitive. If the AustLit indexers wanted to add a new geographical topic in the future such as The Great Dividing Range, they could make it a narrower term of Australia, and a broader term for Blue Mountains, Southern Highlands, Snowy Mountains, Atherton Tablelands. A query for authors born on the Great Dividing Range would then include those born in Katoomba, Cooma, Malanda, etc.

The system builds an XML representation of "user level" objects (agents, works, thesaurus topics) which is transformed by XSLT stylesheets into the output formats of HTML, plain and tagged text. Updates are imported in XML format, and users can also view the "raw" XML that reflects the "FRBR plus event" structure of the data model. The XML/XSLT combination has proven to be very effective at providing flexible, efficient and easily maintainable formatting.

Future Work

Common, immutable, identifiers

The Topic Map community are strong advocates of identifying topics using subject indicators that can then be used to merge topics from different topic maps. The question of identity is a big one for bibliographic systems (and digital rights management systems) - which authors with the name "J Smith" refer to the same person? Which books with the title "A voyage down the Murray" refer to the same works, expressions or manifestations?

AustLit assigns unique topic identifiers to everything - even to dates and textual notes, but when it comes to necessary interfaces such as to the NLA's Kinetica system to discover holdings of a work/expression/manifestation, we are forced to fall back on author/title matches. We are investigating using one of Kinetica's title level "immutable numbers" but there are complications caused by the semantic differences in our data models - the MARC "title" view of Kinetica versus the FRBR view of AustLit. We also need to investigate common, immutable identifiers for agents across AustLit, NLA's systems, manuscript archives and other systems.

Playing our part in the semantic web

Although AustLit makes its data available in an XML format, this is a far cry from making it available in a standard, reusable format. The problem we face is that we are unsure of what standard to use. If RDF or DAML, which vocabulary? If Topic Maps, which serialisation format and which topic types and standard identifiers? If as plain XML, which schema?

Transforming our data into any specific format should be very easy - the hard part is developing common syntax and semantics with interested parties. Hopefully with a concrete implementation now available this process can be assisted.

Enhancing our vocabulary of relationships

There are many relationships between AustLit topics which specialist groups are interested in pursuing, such as deeper relationships between agents, organisations and movements.

Future technology

As RDF and Topic Map standards develop and perhaps coalesce, techniques for representing data natively as RDF/Topic Map "graphs" rather than in relational databases may well develop and offer more natural solutions for the physical representation of the AustLit database. It is hoped that techniques for visualising and browsing huge collections of relationships will be developed that will aid researches find previously undiscovered relationships and patterns within the bibliography.

Conclusions

AustLit was developed as a research tool to assist the literary research community. It already holds an extensive collection of information about Australian Literature - over 3.6 million topics and 5.4 million relationships between topics.

By adopting an approach based on topics and their relationships we have implemented a flexible architecture which we are confident can be extended to meet our own future needs as well as helping to build the Semantic Web, one node at a time.

Acknowledgements

The AustLit project was an ambitious undertaking, combining as it did some new and largely untested models within a uncertain funding environment provided by annual research grants and a physically distributed project team across 8 Universities and the National Library of Australia. The achievement is a tribute to the enthusiasm and skills of all those involved [HREF18], and especially to the leadership and vision of the original Project Manager, Dr Marie-Louise Ayres.

The Project also owes a debt of gratitude to the many thinkers in the library and information worlds and especially those whose work has provided models and tools which underpin the AustLit data model:

Hypertext References

HREF1
http://www.userland.com/whatIsKoolAid

HREF2
http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/

HREF3
http://xml.coverpages.org/topicMaps.html

HREF4
http://www.austlit.edu.au/

HREF5
http://www.ifla.org/VII/s13/frbr/frbr.pdf (pdf)

HREF6
http://www.w3.org/TR/2000/CR-rdf-schema-20000327/

HREF7
http://www.daml.org/2001/03/daml+oil-index

HREF8
http://citeseer.nj.nec.com/parsons00emancipating.html

HREF9
http://xml.coverpages.org/RDF-TopicMaps-LateLazyVersusEarlyPreemptiveReification.html

HREF10
http://www.ontopia.net/topicmaps/materials/rdf.htm

HREF11
http://www.arc.gov.au/default.htm

HREF12
http://www.indecs.org/

HREF13
http://www.ilrt.bris.ac.uk/discovery/harmony/

HREF14
http://www.austlit.edu.au:7777/DataModel/index.html

HREF15
http://www.austlit.edu.au:7777/presentations/nla-presmay2001.ppt (PowerPoint)

HREF16
http://www.austlit.edu.au:7777/presentations/DRRHGateway2001.htm

HREF17
http://www.austlit.edu.au/about/technicalPlatform

HREF18
http://www.austlit.edu.au/about/contributors

HREF19
http://www.w3.org/2001/sw/

HREF20
http://www.lib.adfa.edu.au:85

HREF21
http://www.unsw.adfa.edu.au

HREF22
http://www.w3.org/DesignIssues/Semantic.html/

HREF23
http://www.oasis-open.org/news/oasis_news_10_02_01.shtml

HREF24
http://www.y12.doe.gov/sgml/sc34/document/0129.pdf

HREF25
http://www.ontopia.net/topicmaps/materials/tao.html

HREF26
http://www.ifnet.it/elag2002/papers/pap5.html

Copyright

Kent Fitch, © 2002. The author assigns to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The author also grants a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.