Managing Literature References with Topic Maps

Dr. Dipl.-Ing. Robert Barta, Assoc. Prof., IT School, Bond University [HREF17], QLD, 4226. rho@bond.edu.au

Abstract

This use case shows how Topic Map (TM) technologies can be used consistently for knowledge engineering in the field of literature references. For demonstration we introduce a family of notations, AsTMa*, which allows us not only to capture factual information with TMs but also let us define an appropriate ontology. To benchmark a TM query language we consider a simple application which converts literature references stored in TMs into the BIBTEX format.

1 Introduction

Management of literature references was many years a domain for librarians and academics. In both cases documents are classified according to an agreed global or localized taxonomy and catalogued to allow for an efficient retrieval process later[HREF1]. As the domain is well understood this has led to huge library systems but also to local reference repositories for individuals[10].

In this context the first XTM Topic Map standard[HREF4] has been defined to provide an XML notation to store meta data along with references to external objects. One of the objectives was also that users can have highly localized concepts but also can use globally defined taxonomies to simplify exchange and conciliation of multiple reference catalogs.

Topic Maps also compete in a relatively new arena, the Semantic Web[ HREF16]. Their most obvious difference to RDF[HREF14] is that they use a two-level approach[9, HREF12,13,8]: The more lexicographical part of a topic map consists of topics which represent (reify) real world objects but also abstract concepts. Here the main focus is on naming issues for different contexts and also references to objects external to the topic map.

The semantic aspect of TMs is covered with associations. Other than RDF statements they are not (subject, predicate, object) triples, but coerce an arbitrary number of topics together; each of these topics plays then a specific role in this association.

While gaining quickly popularity in Europe, TM related technologies still struggle for recognition1. For one part this may mirror that it is yet unclear how a constraint language and--related with that--a query language will look like.

In the following we introduce a family of languages designed for authoring, constraining and querying TM information (section 2). While a full language specification is beyond the scope of this paper, we demonstrate some language features using a literature reference database as a running example. Section 3 covers the authoring part and 4 that of the underlying ontology. Once such an ontology is in place, many conversion problems from TM data into other forms can be transformed into the problem of mediating between ontologies. Section 5 explains first the problem of exporting TM data into BIBTEX before section 6 reformulates it as an application for an ontology transformation language.


2 AsTMa* Family

Managing TM information can be broken up into different phases:

All these languages build upon each other: AsTMa! uses AsTMa= for building a taxonomy for the ontology. It is using patterns extending AsTMa= by quantifiers and variables. AsTMa? , in turn, is using AsTMa! patterns to identify relevant information in a topic map (Fig. 1). It then uses AsTMa= as template to generate new content.

Figure 1: AsTMa* family
\includegraphics[width=2.5in]{family}


3 TM authoring

An AsTMa= document can contain any number of topic and association definitions. In their completeness these definitions make up a topic map; the order of the definitions itself is irrelevant.

AsTMa= is a line-oriented format making use as little as possible of special symbols to speed up the authoring process.

3.1 Topics

The following text fragment defines two topics (overlong lines can be wrapped using $\backslash$ as a last character on the line):

ltm-spec (l-specification) reifies \
   http://www.ontopia.net/.../ltm.html
bn: LTM, The Linear Topic Map Notation
bn @ latex: {LTM}, The Linear Topic \
   Map Notation
oc (cite-code) @latex: urn:bibtex:lmg01
in: This technical report defines ...

p-lars-marius-garshol (l-person)
bn @ latex : Garshol, Lars Marius
bn: Lars Marius Garshol
sin: http://www.garshol.priv.no/

The first, ltm-spec, is an instance of the class l-specification, the other of l-person. The classes themselves are again topics which may be defined elsewhere. Otherwise, an AsTMa=  processor might also autogenerate them.

Both topics have a base name (indicated by a line starting with bn:) which may be used for display. A speciality of TMs are scopes which limit the validity of any topic characteristic. In the example above we see that one base name is only valid in the scope latex. We will later make use of this.

The first topic reifies the online document at the given URL. In contrast, the second topic is only about the person; here a subject indicator (sin:) is a weaker form to indicate identity. It may be used by a TM processor to merge this topic with another having the same subject indicator.

Other characteristics a topic may contain are inline data (in:) which may contain descriptive text, and occurrences (oc:) which hold URIs pointing to external resources. Inline data and occurrences can be typed and scoped, as it is the case with the cite-code occurrence above.

3.2 Associations

Associations are statements about the relationships of various topics. The following text

(is-author-of)
opus   : ltm-spec
author : p-lars-marius-garshol

states that the topic p-lars-marius-garshol plays the role of the author in an is-author-of association whereby ltm-spec plays the role of an opus. We assume that all these topic are declared separately.

If we had further information, say that a document covers a particular theme, we can encode it with another association:

(covers-theme)
document : ltm-spec
theme    : topic-maps

Associations not only bring together many topics, they also operate on different semantic planes: The level of topics which play the roles, the topics which are the roles and the one topic which is the nature, the type of the association. Fig. 2 shows a complete association with three players, each of them with a role. Also shown is that associations also can be scoped.

Figure 2:Association
\includegraphics[width=2.5in]{assoc}

Worth noting it is, that associations themselves cannot take part directly in other associations. To allow to make statements about statements we first have to reify a statement with a new topic. That can then play an association role:

(covers-theme) is-reified-by statement-1
document : ltm-spec
theme    : topic-maps

(claim)
contender : p-lars-marius-garshol
claim     : statement-1

Using this technique any levels of meta information can be compiled.


4 TM Literature Ontology

Like in any authoring environment also a particular topic map instance can be subjected to follow a particular structure. The prescribed vocabulary, the taxonomy[15], local constraints on topics and associations, and finally global constraints on the whole map, all this is captured in a TM ontology.

Ontologies--once defined--provide formal and informal rules on how to create TM documents. In an integrated TM authoring environment ontologies can be used to guide the authoring process in the same way as XML schemas are used for XML authoring. Other uses include filtering of TM documents according to their ontology conformance or reconciliation of heterogeneous data sources[11]. As in the following we can also use ontologies to create a projection, i.e. a particular view into an existing topic map. In this sense constraints are specialized queries which select only those parts from a map which conforms to the constraints.

4.1 Taxonomy

To define an ontology $\mathcal{L}$ for literature references we first have to set up some basic vocabulary and taxonomy. This can be done solely with AsTMa= as it involves only simple associations expressing a is-subclass-of relationship:

(is-subclass-of)
subclass: l-book
superclass: l-document

(is-subclass-of)
subclass: l-article
superclass: l-document

# comment: similar for reports, etc.

4.2 Structure

Apart from the vocabulary we have to set up rules on individual literature references. We need constraints that either prescribe, suggest or forbid particular information for a literature reference.

One rule would state that every document, be it now a book or any other, must contain a title. Assuming that the title is stored in the topic base name, we write

forall $d [ * (l-document) ]
   => exists $d [ bn: * ]

Similar to logic-oriented languages a validator will first try to identify all information within a map which matches the pattern [ * (l-document) ]. Generalizing the AsTMa= syntax the pattern is using wildcards (and regular expressions) to identify topics which are a direct or indirect instance of l-document. The wildcard * indicates that we are not interested in the actual topic id matched during this process.

The matched topic may contain more information than provided in the pattern, like additional base names or occurrences; the [] pattern operator signals that a match will already succeed if only the given pattern is satisifed.

Once a particular part of a map is matched, this submap will be bound to the variable $d. For every such value the second part of the rule will be evaluated: It will be checked whether this part of the map also matches [ bn: * ] , i.e. whether the topic contains a basename or not.

As we later want to use this title information in BIBTEX we want to encourage the topic map author to add a LATEX variant of the title:

forall $d [ * (l-document) ]
   => suggested 
        exists $d [ bn @ latex: * ]

The keyword suggested will signal to a validator to ignore this rule whereas authoring environments may solicitate user input.

Rules can also have a more global range as is the case with a constraint which prescribes that every document also must have an author:

forall [ $t (l-document) ]
  => exists ] (is-author-of)
              opus   : $t
              author : $a [
     AND
     exists [ $a (person) ]

Like above, we first single out all topics which are instances of l-document. This time we bind the topic id to the variable $t. For all those documents we prescribe then the existence of an appropriate association. In contrast to the [] pattern operator above, the ][ around the association pattern has to be read as exactly so. Consequently, the association must precisely contain two members, one playing the role of opus and another playing the author. The use of $t in the association should ensure that the currently investigated document is playing the opus role. The topic id then matched and bound to $a will then be used to find a person with that id.

The example also shows the use of boolean operators to combine clauses.

4.3 Application Specific Constraints

The final set of constraints deals more with the application domain. One example of such a constraint is every person in the map must be either an author or an editor of some document.

forall [ $p (person) ]
   => exists [ (is-author-of)
               author : $p ]
      OR
      exists [ (is-editor-of)
               editor : $p ]

This time we bind all instances of person to the variable $p and check whether this id plays a role in either of the two association patterns. The use of [] again indicates only our minimal expectations. The matched associations in the topic map may contain more members.

Application specific rules can also help to add implicit knowledge to a domain where explicit encoding would be too cumbersome: While we could code every person to be an instance of author or editor it is clear that a person should only be an author if the person is involved in an is-author-of association:

forall [ (is-author-of)
         author : $p ]
   => derived exists [ $p (author) ]

With derived we signal to the validator to ignore this rule; it is directed to a query processor which may make use of this implicit knowledge and may consider facts like

p-lars-marius-garshol (author)

although they are not explicitely stated.


5 Target Ontology

In this section we discuss one particular application where we use our literature ontology $\cal{L}$ to filter only specific aspects out of an existing topic map. In this process we have to define a mapping between literature reference information in Topic Map form and a conventional database system which follows the entity-attribute paradigm.

Our database format will be BIBTEX[10] which prescribes a set of document classes (books, reports, etc.). Each of these classes consists of specific mandatory or optional attributes (author, title, etc.).

The topic ltm-spec defined in listing 3.1 would be represented in BIBTEX as follows:

@misc{urn:bibtex:lmg01,
      author = {{Garshol, Lars Marius}},
      title = {{LTM}, The Linear Topic
              Map Notation},
      year = {2001},
      url = {http://www.ontopia.net/...}
}

As such, BIBTEX follows its own schema definition which serves there the role of an ontology. To allow for a formalized mapping within one single framework, we have captured this BIBTEX schema in another set of AsTMa! constraints $\mathcal{B}$.

Notably, BIBTEX is a schema based on document classes which may or must contain specific attributes. This means that we first have to introduce classes and attributes as topics for our target vocabulary:

b-book (class)

b-report (class)
# ... and all other classes follow

b-publisher (attribute)
# ... all other attributes here

Given that, we can now make explicit the rules specific for BIBTEX, such as that a book must have a title (we conveniently store that in a base name):

forall $b [ * (b-book) ]
    => exist $b [ bn: * ]

All other attributes we plan to represent via a generic is-attribute-of association. We only have to define that all these associations must have a particular structure:

forall $a [ (is-attribute-of) ]
   => exists $a ] (is-attribute-of) 
                   object    : *
                   attribute : *
                   value     : * [

A more application specific rule would be that every book also must have a publisher attribute while we do not care about its value:

forall [ $b (b-book) ]
  => exists [ (is-attribute-of)
              object    : $b
              attribute : b-publisher ]

In a similar way we proceed with all other attributes and all other BIBTEX classes. Some complications arise, though:


6 Ontology Mapping

To mediate between $\mathcal{L}$ and the BIBTEX ontology $\mathcal{B}$ (Fig. 3) we can hardcode the mapping directly into an application. This was actually done, not only to get working code but also to understand the practicalities involved.

Figure 3: Translating between ontologies
\includegraphics[width=2.5in]{mediate}

In a first step the relevant topics of the literature ontology have to be identified. In our case we are interested in all topics which are a direct or indirect subclass of l-document. For all these (l-book, l-article, etc.) we have to define their respective counterparts in $\mathcal{B}$. While obviously a l-book in $\mathcal{L}$ will correspond with b-book in $\mathcal{B}$, for other document types in $\mathcal{L}$ this choice is less obvious.

According to $\mathcal{B}$ the class then will define which attributes are mandatory and which are optional for this object. For the book the attributes title, publisher, year and author or editor have to be defined, whereas the volume and other attributes are optional.

For all the above attributes values have to be identified in the source map. For the example specification document provided in listing 3.1, the application will have to follow all is-author-of associations for that particular document to identify the topics playing the author role there. The base names of the respective topics will then be used as author names.

With these specifications a dedicated application can now perform the mapping. One of the promises of a uniform formalism, though, is that such a mapping between two ontologies can be defined within that formalism. For this purpose we make use of an experimental TM query language, AsTMa? .

In a similar way as SQL operates on tables to return a table and XQuery[HREF2] operates on XML documents to return XML documents, AsTMa? queries analyze topic maps following the source ontology and return maps conforming to a target ontology.

As an example let us consider the conversion of books together with their titles:

in "literature.tm"
  where
    exists [ $b (l-book)
             bn @ latex : $t ]
  return
    {$b} (b-book)
    bn: {$t}

Here we iterate over a topic map sourced from the given address (the details thereof are not relevant here) and look for all submaps which conform to the condition provided by the where clause. We are selecting all submaps which contain an l-book topic with a LATEX-scoped title. According to the AsTMa? language semantics only those submaps will be considered which are minimal in that they do not contain unnecessarily other topics or associations (no junk). In our case the submaps will only consist of individual l-book topics.

For all these submaps the return clause is evaluated. It contains AsTMa= code, this time for constructing a new map. We reuse the topic id of a particular l-book topic also as id for the corresponding topic in the target map. As these ids are bound to the variable $b the value can be referred to as {$b}2 in the result section.

A more sophisticated query would capture more information about a book:

in "literature.tm"
  where
    exists [ $b (l-book)
             bn @ latex : $t ]
    and
    exists [ (is-author-of)
             opus   : $b
             author : $p ]
    and
    exists{1,}
           [ $p (person)
             bn @ latex : $n ]
  return
    {$b} (b-book)
    bn: {$t}

    (is-attribute-of)
    class     : {$b}
    attribute : author
    value     : {join(", ", @n)}

Again we use first a pattern to identify those topics being an l-book. In the second exists clause we now identify the association which links the author to the book $b. In the third clause we have modified the exists quantor to exists{1,} using a notation used for extended regular expressions. With this we signal to the TM processor greedy matching, i.e. that the matched submap should have at least one but then as many instances of this pattern as possible. This results then in a list of matches for this clause.

As before, the processor will only pass through those minimal maps which do not violate the where clause (no junk).

In the construction part within the return clause we again refer to a single book adding the matched title as base name. In accordance with $\mathcal{B}$ we also generate an association is-attribute-of to add the author information which we have matched before. As we have matched multiple such names due to greedy matching, the processor will have captured the individual names within a list @n which contains all invididually matched names indicated by $n above. We concatenate these strings in the list using a textual join and use the result as attribute value.

Once the target map has been built it is trivial to serialize this into the final BIBTEX text format.

7 Conclusion

We have demonstrated how TM engineering can be used to manage meta data content and how ontologies can be used to constrain instances of this content. The declarative nature of ontologies allows us to freely combine ontologies. Thus a document which satifies two ontologies can be said to satisfy an AND-combination of the two.

Then we formalized a simple entity-attribute model into our ontology framework using rather generic associations. This was the basis to formalize the mapping between the two ontologies. That enables a query processor to translate a topic map from one ontology into another.

Use cases like these have helped to shape the AsTMa* languages, specifically to find out about the issues involved in practical situations.

8 Future Work

As mentioned before, we have implemented a dedicated translator based on an existing Topic Map package in Perl. While this poses a proof of concept, the final goal is to establish a generic translation based on AsTMa? . For this purpose we have already started to implement AsTMa! by converting constraints into Prolog (e.g. [3]). From there we plan to implement the query language itself.

On a more formal track, we have started to sketch a rigorously formal framework for ontology engineering based on category theory. This $tau$ algebra may serve as fundament in the same was as the relational algebra is for relational databases.

References

1
US MARC Standards, [HREF1], US Library of Congress, Network Development and Marc standards Office.
2
XQuery, W3C Working Draft 16 august 2002. [HREF2], W3C.
3
Prolog for Programmers. Academic Press, 1985. F. Kluzniak, S. Szpakowicz.
4
XML Topic Maps (XTM) 1.0 Specification. [HREF4] TopicMaps.Org, 2001.
5
Barta, R., AsTMa= language definition, technical report. Bond University, 2001. [HREF5]
6
Barta, R., AsTMa! language definition, technical report. Bond University, 2002. [HREF6].
7
Barta, R. AsTMa? (Asymptotic Topic Map Notation, Querying), language definition, technical report. Bond University, 2003.
[HREF7].
8
Freese, E. Topic Maps and RDF. Int. SGML/XML Users' Group, 2001.
9
Graham D. Moore. RDF and Topic Maps - an exercise in convergence. XML Europe 2001, 2001.
10
H. Kopka, P. W. Daly. A Guide to LaTeX. Addison Wesley, 1999.
11
H. Wache, T. Voegele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neumann and S. Huebner. Ontology-based integration of information - a survey of existing approaches [HREF11]. Proceedings of the IJCAI-01 Workshop: Ontologies and Information Sharing', Seattle, WA, pages 108-117.
12
Jonathan Robie. The syntactic web - syntax and semantics on the web [HREF12]. XML 2001, 2001.
13
Lacher,-M.-S.; Decker,-S. RDF, Topic Maps, and the Semantic Web. Markup-Languages:-Theory-&-Practice. Summer 2001; 3(3): 313-31, 2001.
14
O. Lassila and K. Swick. Resource description frame-work (RDF) model and syntax specification, technical report, W3C [HREF14]. 1999.
15
Rath,-H.-H. Topic maps: templates, topology, and type hierarchies. Markup-Languages:-Theory-&-Practice. Winter 2000; 2(1): 45-64, 2000.
16
T. Berners-Lee. Feature article: The semantic web, [HREF16]. Scientific American, 2001.

Hypertext References

HREF1
http://www.loc.gov/marc/marc.html
HREF2
http://www.w3.org/TR/xquery/
HREF4
http://www.topicmaps.org/xtm/1.0
HREF5
http://www.it.bond.edu.au/publications/02TR/02-14.pdf
HREF6
http://www.it.bond.edu.au/publications/03TR/03-05.pdf
HREF7
http://astma.it.bond.edu.au/astma%3F-spec.dbk
HREF11
http://www.tzi.de/buster/papers/SURVEY.pdf
HREF12
http://www.idealliance.org/papers/xml2001/papers/pdf/03-01-04.pdf
HREF14
http://www.w3.org/TR/1999/REC-rdf-syntax-19990222.html
HREF16
http://www.scientificamerican.com/2001/0501issue/0501berners-lee.html
HREF17
http://www.bond.edu.au/



Footnotes

... recognition1
The industrial adoption is higher than that in the academia.
... {$b}2
Here we borrow from the XQuery syntax.