Dr. Dipl.-Ing. Robert Barta, Assoc. Prof., IT School, Bond University [HREF17], QLD, 4226. rho@bond.edu.au
This use case shows how Topic Map (TM) technologies can be used consistently for knowledge engineering in the field of literature references. For demonstration we introduce a family of notations, AsTMa*, which allows us not only to capture factual information with TMs but also let us define an appropriate ontology. To benchmark a TM query language we consider a simple application which converts literature references stored in TMs into the BIBTEX format.
Management of literature references was many years a domain for librarians and academics. In both cases documents are classified according to an agreed global or localized taxonomy and catalogued to allow for an efficient retrieval process later[HREF1]. As the domain is well understood this has led to huge library systems but also to local reference repositories for individuals[10].
In this context the first XTM Topic Map standard[HREF4] has been defined to provide an XML notation to store meta data along with references to external objects. One of the objectives was also that users can have highly localized concepts but also can use globally defined taxonomies to simplify exchange and conciliation of multiple reference catalogs.
Topic Maps also compete in a relatively new arena, the Semantic Web[ HREF16]. Their most obvious difference to RDF[HREF14] is that they use a two-level approach[9, HREF12,13,8]: The more lexicographical part of a topic map consists of topics which represent (reify) real world objects but also abstract concepts. Here the main focus is on naming issues for different contexts and also references to objects external to the topic map.
The semantic aspect of TMs is covered with associations. Other than RDF statements they are not (subject, predicate, object) triples, but coerce an arbitrary number of topics together; each of these topics plays then a specific role in this association.
While gaining quickly popularity in Europe, TM related technologies still struggle for recognition1. For one part this may mirror that it is yet unclear how a constraint language and--related with that--a query language will look like.
In the following we introduce a family of languages designed for authoring, constraining and querying TM information (section 2). While a full language specification is beyond the scope of this paper, we demonstrate some language features using a literature reference database as a running example. Section 3 covers the authoring part and 4 that of the underlying ontology. Once such an ontology is in place, many conversion problems from TM data into other forms can be transformed into the problem of mediating between ontologies. Section 5 explains first the problem of exporting TM data into BIBTEX before section 6 reformulates it as an application for an ontology transformation language.
Managing TM information can be broken up into different phases:
While XTM is such a notation, due to the self-describing property of XML XTM documents tend to be rather verbose and are thus difficult to maintain. AsTMa= [HREF5], the first sublanguage of AsTMa* is addressing the problem by providing a shorthand notation for the most important aspects of TM authoring.
Such ontologies can be used for validating topic maps, for supporting the authoring process but also for filtering out those aspects of a map which conform to the set of given constraints.
More abstractedly, a query will analyze a topic map and will searche for particular patterns. From the data in these patterns the output is then generated. If--out of symmetry reasons--we choose the result to be again a topic map, then a query language becomes a topic map transformation language, mediating between two--mostly different--ontologies: that of the original topic map and that of the result map. AsTMa? [HREF7] covers this part.
All these languages build upon each other: AsTMa! uses AsTMa= for building a taxonomy for the ontology. It is using patterns extending AsTMa= by quantifiers and variables. AsTMa? , in turn, is using AsTMa! patterns to identify relevant information in a topic map (Fig. 1). It then uses AsTMa= as template to generate new content.
An AsTMa= document can contain any number of topic and association definitions. In their completeness these definitions make up a topic map; the order of the definitions itself is irrelevant.
AsTMa= is a line-oriented format making use as little as possible of special symbols to speed up the authoring process.
The following text fragment defines two topics (overlong lines
can be wrapped using
as a last
character on the line):
ltm-spec (l-specification) reifies \
http://www.ontopia.net/.../ltm.html
bn: LTM, The Linear Topic Map Notation
bn @ latex: {LTM}, The Linear Topic \
Map Notation
oc (cite-code) @latex: urn:bibtex:lmg01
in: This technical report defines ...
p-lars-marius-garshol (l-person)
bn @ latex : Garshol, Lars Marius
bn: Lars Marius Garshol
sin: http://www.garshol.priv.no/
The first, ltm-spec, is an instance of the class l-specification, the other of l-person. The classes themselves are again topics which may be defined elsewhere. Otherwise, an AsTMa= processor might also autogenerate them.
Both topics have a base name (indicated by a line starting with bn:) which may be used for display. A speciality of TMs are scopes which limit the validity of any topic characteristic. In the example above we see that one base name is only valid in the scope latex. We will later make use of this.
The first topic reifies the online document at the given URL. In contrast, the second topic is only about the person; here a subject indicator (sin:) is a weaker form to indicate identity. It may be used by a TM processor to merge this topic with another having the same subject indicator.
Other characteristics a topic may contain are inline data (in:) which may contain descriptive text, and occurrences (oc:) which hold URIs pointing to external resources. Inline data and occurrences can be typed and scoped, as it is the case with the cite-code occurrence above.
Associations are statements about the relationships of various topics. The following text
(is-author-of) opus : ltm-spec author : p-lars-marius-garshol
states that the topic p-lars-marius-garshol plays the role of the author in an is-author-of association whereby ltm-spec plays the role of an opus. We assume that all these topic are declared separately.
If we had further information, say that a document covers a particular theme, we can encode it with another association:
(covers-theme) document : ltm-spec theme : topic-maps
Associations not only bring together many topics, they also operate on different semantic planes: The level of topics which play the roles, the topics which are the roles and the one topic which is the nature, the type of the association. Fig. 2 shows a complete association with three players, each of them with a role. Also shown is that associations also can be scoped.
Worth noting it is, that associations themselves cannot take part directly in other associations. To allow to make statements about statements we first have to reify a statement with a new topic. That can then play an association role:
(covers-theme) is-reified-by statement-1 document : ltm-spec theme : topic-maps (claim) contender : p-lars-marius-garshol claim : statement-1
Using this technique any levels of meta information can be compiled.
Like in any authoring environment also a particular topic map instance can be subjected to follow a particular structure. The prescribed vocabulary, the taxonomy[15], local constraints on topics and associations, and finally global constraints on the whole map, all this is captured in a TM ontology.
Ontologies--once defined--provide formal and informal rules on how to create TM documents. In an integrated TM authoring environment ontologies can be used to guide the authoring process in the same way as XML schemas are used for XML authoring. Other uses include filtering of TM documents according to their ontology conformance or reconciliation of heterogeneous data sources[11]. As in the following we can also use ontologies to create a projection, i.e. a particular view into an existing topic map. In this sense constraints are specialized queries which select only those parts from a map which conforms to the constraints.
To define an ontology
for literature
references we first have to set up some basic vocabulary and
taxonomy. This can be done solely with AsTMa= as it involves
only simple associations expressing a is-subclass-of
relationship:
(is-subclass-of) subclass: l-book superclass: l-document (is-subclass-of) subclass: l-article superclass: l-document # comment: similar for reports, etc.
Apart from the vocabulary we have to set up rules on individual literature references. We need constraints that either prescribe, suggest or forbid particular information for a literature reference.
One rule would state that every document, be it now a book or any other, must contain a title. Assuming that the title is stored in the topic base name, we write
forall $d [ * (l-document) ] => exists $d [ bn: * ]
Similar to logic-oriented languages a validator will first try to identify all information within a map which matches the pattern [ * (l-document) ]. Generalizing the AsTMa= syntax the pattern is using wildcards (and regular expressions) to identify topics which are a direct or indirect instance of l-document. The wildcard * indicates that we are not interested in the actual topic id matched during this process.
The matched topic may contain more information than provided in the pattern, like additional base names or occurrences; the [] pattern operator signals that a match will already succeed if only the given pattern is satisifed.
Once a particular part of a map is matched, this submap will be bound to the variable $d. For every such value the second part of the rule will be evaluated: It will be checked whether this part of the map also matches [ bn: * ] , i.e. whether the topic contains a basename or not.
As we later want to use this title information in BIBTEX we want to encourage the topic map author to add a LATEX variant of the title:
forall $d [ * (l-document) ]
=> suggested
exists $d [ bn @ latex: * ]
The keyword suggested will signal to a validator to ignore this rule whereas authoring environments may solicitate user input.
Rules can also have a more global range as is the case with a constraint which prescribes that every document also must have an author:
forall [ $t (l-document) ]
=> exists ] (is-author-of)
opus : $t
author : $a [
AND
exists [ $a (person) ]
Like above, we first single out all topics which are instances of l-document. This time we bind the topic id to the variable $t. For all those documents we prescribe then the existence of an appropriate association. In contrast to the [] pattern operator above, the ][ around the association pattern has to be read as exactly so. Consequently, the association must precisely contain two members, one playing the role of opus and another playing the author. The use of $t in the association should ensure that the currently investigated document is playing the opus role. The topic id then matched and bound to $a will then be used to find a person with that id.
The example also shows the use of boolean operators to combine clauses.
The final set of constraints deals more with the application domain. One example of such a constraint is every person in the map must be either an author or an editor of some document.
forall [ $p (person) ]
=> exists [ (is-author-of)
author : $p ]
OR
exists [ (is-editor-of)
editor : $p ]
This time we bind all instances of person to the variable $p and check whether this id plays a role in either of the two association patterns. The use of [] again indicates only our minimal expectations. The matched associations in the topic map may contain more members.
Application specific rules can also help to add implicit knowledge to a domain where explicit encoding would be too cumbersome: While we could code every person to be an instance of author or editor it is clear that a person should only be an author if the person is involved in an is-author-of association:
forall [ (is-author-of)
author : $p ]
=> derived exists [ $p (author) ]
With derived we signal to the validator to ignore this rule; it is directed to a query processor which may make use of this implicit knowledge and may consider facts like
p-lars-marius-garshol (author)
although they are not explicitely stated.
In this section we discuss one particular application where we
use our literature ontology
to filter only
specific aspects out of an existing topic map. In this process we
have to define a mapping between literature reference information
in Topic Map form and a conventional database system which follows
the entity-attribute paradigm.
Our database format will be BIBTEX[10] which prescribes a set of document classes (books, reports, etc.). Each of these classes consists of specific mandatory or optional attributes (author, title, etc.).
The topic ltm-spec defined in listing 3.1 would be represented in BIBTEX as follows:
@misc{urn:bibtex:lmg01,
author = {{Garshol, Lars Marius}},
title = {{LTM}, The Linear Topic
Map Notation},
year = {2001},
url = {http://www.ontopia.net/...}
}
As such, BIBTEX follows its own schema definition
which serves there the role of an ontology. To allow for a
formalized mapping within one single framework, we have captured
this BIBTEX schema in another set of
AsTMa! constraints
.
Notably, BIBTEX is a schema based on document classes which may or must contain specific attributes. This means that we first have to introduce classes and attributes as topics for our target vocabulary:
b-book (class) b-report (class) # ... and all other classes follow b-publisher (attribute) # ... all other attributes here
Given that, we can now make explicit the rules specific for BIBTEX, such as that a book must have a title (we conveniently store that in a base name):
forall $b [ * (b-book) ]
=> exist $b [ bn: * ]
All other attributes we plan to represent via a generic is-attribute-of association. We only have to define that all these associations must have a particular structure:
forall $a [ (is-attribute-of) ]
=> exists $a ] (is-attribute-of)
object : *
attribute : *
value : * [
A more application specific rule would be that every book also must have a publisher attribute while we do not care about its value:
forall [ $b (b-book) ]
=> exists [ (is-attribute-of)
object : $b
attribute : b-publisher ]
In a similar way we proceed with all other attributes and all other BIBTEX classes. Some complications arise, though:
forall [ $t (l-document) ]
=> suggested exists
[ $t
oc @ latex (cite-code): * ]
As the code is only useful in a particular context, we have added a scope latex to restrict its validity to that scope. If such a cite code does not exist for a document then it must be generated, the topic id being a good starting point.
We also have to ensure that cite codes are unique within one map:
forall [ $t1
oc (cite-code): $code ]
=> not exists
[ $t2
oc (cite-code): $code ]
This rule first singles out all topics having a cite code. A single topic id will be bound to the variable $t1 and the corresponding value of the cite code is bound to the variable $code in the first clause. Then it will be checked whether there is a topic which contains an identical cite code in the second clause. The -- somewhat unorthodox, but convenient -- language semantics of AsTMa! enforces that two differently named variables cannot be bound to the same values, so the topic ids must be different.
As it is difficult to formalize these rules in
, we will have to burden the author of a map
conforming to
to provide appropriate
input:
forall $t [ (document) ]
=> suggested exists $t
[ bn @ latex : * ]
forall [ (is-author-of)
author: $a ]
=> suggested exists
[ $a
bn @ latex : * ]
If the mapping application then finds a LATEX variant of the title, that should have preference over an unscoped title. In the similar way author naming can be tailored for LATEX.
To mediate between
and the
BIBTEX ontology
(Fig. 3) we can hardcode
the mapping directly into an application. This was actually done,
not only to get working code but also to understand the
practicalities involved.
In a first step the relevant topics of the literature ontology
have to be identified. In our case we are interested in all topics
which are a direct or indirect subclass of l-document. For
all these (l-book, l-article, etc.) we have to
define their respective counterparts in
.
While obviously a l-book in
will
correspond with b-book in
, for
other document types in
this choice is less
obvious.
According to
the class then will define
which attributes are mandatory and which are optional for this
object. For the book the attributes title,
publisher, year and author or
editor have to be defined, whereas the volume and
other attributes are optional.
For all the above attributes values have to be identified in the source map. For the example specification document provided in listing 3.1, the application will have to follow all is-author-of associations for that particular document to identify the topics playing the author role there. The base names of the respective topics will then be used as author names.
With these specifications a dedicated application can now perform the mapping. One of the promises of a uniform formalism, though, is that such a mapping between two ontologies can be defined within that formalism. For this purpose we make use of an experimental TM query language, AsTMa? .
In a similar way as SQL operates on tables to return a table and XQuery[HREF2] operates on XML documents to return XML documents, AsTMa? queries analyze topic maps following the source ontology and return maps conforming to a target ontology.
As an example let us consider the conversion of books together with their titles:
in "literature.tm"
where
exists [ $b (l-book)
bn @ latex : $t ]
return
{$b} (b-book)
bn: {$t}
Here we iterate over a topic map sourced from the given address (the details thereof are not relevant here) and look for all submaps which conform to the condition provided by the where clause. We are selecting all submaps which contain an l-book topic with a LATEX-scoped title. According to the AsTMa? language semantics only those submaps will be considered which are minimal in that they do not contain unnecessarily other topics or associations (no junk). In our case the submaps will only consist of individual l-book topics.
For all these submaps the return clause is evaluated. It contains AsTMa= code, this time for constructing a new map. We reuse the topic id of a particular l-book topic also as id for the corresponding topic in the target map. As these ids are bound to the variable $b the value can be referred to as {$b}2 in the result section.
A more sophisticated query would capture more information about a book:
in "literature.tm"
where
exists [ $b (l-book)
bn @ latex : $t ]
and
exists [ (is-author-of)
opus : $b
author : $p ]
and
exists{1,}
[ $p (person)
bn @ latex : $n ]
return
{$b} (b-book)
bn: {$t}
(is-attribute-of)
class : {$b}
attribute : author
value : {join(", ", @n)}
Again we use first a pattern to identify those topics being an l-book. In the second exists clause we now identify the association which links the author to the book $b. In the third clause we have modified the exists quantor to exists{1,} using a notation used for extended regular expressions. With this we signal to the TM processor greedy matching, i.e. that the matched submap should have at least one but then as many instances of this pattern as possible. This results then in a list of matches for this clause.
As before, the processor will only pass through those minimal maps which do not violate the where clause (no junk).
In the construction part within the return clause we
again refer to a single book adding the matched title as base name.
In accordance with
we also generate an
association is-attribute-of to add the author information
which we have matched before. As we have matched multiple such
names due to greedy matching, the processor will have captured the
individual names within a list @n which contains all
invididually matched names indicated by $n above. We
concatenate these strings in the list using a textual join
and use the result as attribute value.
Once the target map has been built it is trivial to serialize this into the final BIBTEX text format.
We have demonstrated how TM engineering can be used to manage meta data content and how ontologies can be used to constrain instances of this content. The declarative nature of ontologies allows us to freely combine ontologies. Thus a document which satifies two ontologies can be said to satisfy an AND-combination of the two.
Then we formalized a simple entity-attribute model into our ontology framework using rather generic associations. This was the basis to formalize the mapping between the two ontologies. That enables a query processor to translate a topic map from one ontology into another.
Use cases like these have helped to shape the AsTMa* languages, specifically to find out about the issues involved in practical situations.
As mentioned before, we have implemented a dedicated translator based on an existing Topic Map package in Perl. While this poses a proof of concept, the final goal is to establish a generic translation based on AsTMa? . For this purpose we have already started to implement AsTMa! by converting constraints into Prolog (e.g. [3]). From there we plan to implement the query language itself.
On a more formal track, we have started to sketch a rigorously
formal framework for ontology engineering based on category theory.
This
algebra may serve as fundament in the same
was as the relational algebra is for relational databases.