A path language for Topic Maps

<< 2009-09-23 11:01 >>

Highrises, Oslo

I sketched a little path-based query language for Topic Maps this summer, mostly to explore what such a language might look like. My TMQL co-editor, Rani Pinchuk, asked me to write up a more detailed description of it, and that's what this blog posting is.

Please note that the intention is NOT that this is to replace the existing TMQL draft. This is just a little experiment to explore other forms a path language (as opposed to a full-blown TMQL) might take. It is not particularly complete. What happens to it is at the moment unknown.

Basics

Simply put, it works like this:

expression / axis::typefilter [ predicate ]

The initial expression can be any kind of expression that produces Topic Maps data, such as a variable reference, or a reference to a topic. The slash signals that here comes a path navigation step. The axis then says in what direction you are navigating from where you are, the typefilter refers to a type used to filter the objects returned, and the predicate is a boolean expression used for more detailed filtering. Of course, the predicate is optional, and can be left out. Every Topic Maps item type has a default axis, and so the axis:: part can also be left out. And, finally, * is used to indicate that any type is acceptable.

If we wanted my email address, we could find it as follows:

lmg / occurrences::email

The default axis for topics is the union of the name, occurrence, and role axes, so we could also do it like this:

lmg / email

If there is no initial expression this is interpreted to mean the topic map itself. The default axis for topic maps is the union of the topic and association axes. Thus, if we wanted to find the email address of every person in the topic map, we could do it like this:

/ person / email

If we instead wanted every person which has at least one email address we can use the predicate. Inside the predicate, period (.) is used to refer to what in XPath is called the context node. Thus, it would come out like this:

/ person [ . / email ]

Associations

To traverse the employed-by association from me the long-winded way, one would do as follows:

lmg / employee / parent::employed-by / employer / *

Since the default axis for topics includes the roles the first steps gives us all roles of type employee. The second step follows the parent axis to the association, which we filter by type for good measure. The third step follows the default axis from associations (which is the role axis) and filters it so that we only get employer roles. Of course, this gives us the role, so a final step is needed to find the player (which is the default axis from roles).

Obviously, this is not very pleasant, so a shorthand has been devised, which expands to the expression above:

lmg / employed-by(employee -> employer)

Some more examples

There is also a scope shorthand:

lmg / employee / parent::employed-by @past
  / employer / *

This would only produce the employed-by associations in the past scope. The shorthand is equivalent to:

lmg / employee / parent::employed-by [ . / scope::* == past ]
  / employer / *

Some more examples:

$person / subject-identifiers::*
  [ starts-with(., "http://psi.ontopedia.net") ]

$person / description @english

$person / tmdm:topic-name  # all names of the default type

Use case solutions

To give a bit more of the flavour of the language, let's try solving some of the TMQL use cases. These were never meant to be use cases for a path language, and so they're not the sort of query you're meant to do with a path language, but let's do it anyway.

5.2.2.1: "Retrieve all author names, i.e. the name of a topic which plays the role author in an is-author-of association."

# first doing it literally
/ * [ . / author / parent::is-author-of ] / name::*

# more natural
/ person [ . / is-author-of(author -> opus) ]

5.2.2.8: "Retrieve all topic identifiers of documents which have a download URL."

/ document [ . / download ] / item-identifier::*

5.2.2.14: "Retrieve all documents which have a title in german (i.e. a basename in the scope de)."

/ document [ . / name::* @de]

5.2.2.16: "Retrieve the identifiers of all topics which represent information resources on the ontopia.net server(s)."

/ * [ . / subject-locator::* [ contains(., "ontopia.net") ]]
  / item-identifier::*

5.2.2.23: "Retrieve a list of all occurrence items being of type email."

/ * / email

The axes

The axes on topic map items are:

default -> topic + association
topic
association
reifier

The axes on topic items are:

default -> name + occurrence + role
name
occurrence
role
subject-identifier
subject-locator
item-identifier
parent
reified
type

The axes on association items are:

default -> role
role
reifier
scope
parent
type

The axes on role items are:

default -> player
player
reifier
parent
type

The axes on name items are:

default -> variant
variant
reifier
scope
value
parent
type

The axes on occurrence items are:

default -> nothing
reifier
scope
value
datatype
parent
type

The axes on variant items are:

default -> nothing
scope
reifier
value
datatype
parent

Formally, the axes are defined as functions. For example, the role axis is defined as: role(in : topic/association, out : role).

Comments

David Damen - 2009-09-23 07:51:05

I got a question on the first of your Use Case Solutions (5.2.2.1).

Wouldn't the more natural solution have to be: / name::person [ . / is-author-of(author -> opus) ]

Omitting the "name::"-part would trigger the default behaviour and also return occurrences and roles.

Lars Marius - 2009-09-23 07:57:18

I'm afraid you seem to have misunderstood parts of the language. I'll try to clear up the confusion.

The following query

/ name::person

means: start from the topic map and follow the "name" axis (but topic maps don't have a name axis), and find only names of type "person" (but of course it's topics that are of type "person", and not names).

The reason we start from the topic map is that there's nothing in front of the first slash.

I hope this helps.

Thomas Neidhart - 2009-09-24 04:42:54

Nice work, a very elegant way of querying topic maps, though I have some critical words about it:

1) one of the strengths of topic maps is, that it formalizes specific information elements in a way, humans can immediately grasp the knowledge they are interested in (e.g. Topics, Names, Occurrences). In your approach you introduce ambiguity (like it was done and criticised in the current TMQL draft), which makes it difficult to interpret the result of a query without knowing the data itself.

2) For me, an approach like this, does not seem to be capable to process all queries someone can probably think of, but then we are in the same situation as we are now with TMQL: we would need different expression styles for different types of queries, which increases complexity of a standard and decreases the chance it will be implemented in full detail, which in term decreases the interoperability chances and thus the whole idea of a standard.

In my personal opinion, I think the TMQL standard should be able to express all kinds of queries that you can think of, in a complete and formal way. This means it will probably be difficult or elaborate to write such queries, but it should serve as the foundation to query Topic Maps. For special use cases, where it is necessary to have a more elegant or easier way to express certain queries, one could create a language on top of this coming TMQL, that just translates such queries in TMQL syntax, which should be an easy task btw.

As I have been working with TOMA and TOLOG for some time, I appreciate certain features of the two languages, and would like to merge them in an upcoming standard: formal path expression from TOMA (with reworked association expressions) together with the predicate logic of TOLOG.

Robert Barta - 2009-09-24 06:06:03

ad Thomas Neidhart:

> but then we are in the same situation as we are now with TMQL: we would need different expression styles for different types of queries ....

That's just wrong. It has been said/written 10000 times already that the expressiveness of FLWR/path/SQL is identical, and the only difference is that you can (a) spit out complex content only with FLWR and (b) cannot have explicit variables with PEs, then I have now written this 10001 times.

Muehselig. That's why I have given up on it.

Lars Marius - 2009-09-24 07:01:03

Thomas, regarding your point #1: this was deliberate. Personally, I want to see queries formulated in terms of the domain model, and not in terms of the TMDM. However, it is possible to write all queries in a more explicit style by always including the axis names.

I'd be interested in seeing proposals for approaches that make the name/occ/assoc distinction explicit. tolog distinguishes between name/occ and assoc, but, well, I think it's hard to do this without making queries unacceptably ugly.

Re your #2: If we expand my proposal to include all the operators that the TMQL draft path expressions have (such as producing tuples, ordering etc), then the path language will be as expressive as the other languages. Whether we do this or not remains to be seen, obviously.

Re merging path expressions and predicates: this is, in a sense, what the existing TMQL draft was meant to do, and is to some degree also what it does. Personally, I'm relatively happy with the compromise that's in the current draft, but proposals are always welcome.

Thomas Neidhart - 2009-09-24 09:02:23

hmm, probably I was overwhelmed by the complexity of the TMQL standard. I looked into it, and realized, that I can also use path expressions in the select style projection, but it is still looks a bit awkward to me. Think about the following query:

Get me the name of all operas and the name of their composer, if the composer is known, otherwise return "unknown author".

In the path expression style I would express it in this way:

// opera (. / name, . <- work [ ^ is-composed-by ] -> composer / name || "unknown author")

For the select style, my best guess so far:

select $opera / name, // opera (. <- work [ ^ is-composed-by ] -> composer / name || "unknown author") where $opera isa opera

but this will not produce correct results, as it will not match opera names correctly with composer names. Can somebody enlight me in this regard?

@lars: its obviously a design choice, but I would prefer a distinguishable grammar, in order to be able to get predictable results.

Regarding comment two, I totally agree with you, and also that the current TMQL draft basically reflects this, and I will try to come back to you for a proposal how to tackle the name/occ/assoc distinction.

Thomas Neidhart - 2009-09-24 09:25:14

ah I think I got it:

select $opera / name, ($opera <- work [ ^ is-composed-by ] -> composer / name || "unknown author") where $opera isa opera

Robert Barta - 2009-09-24 09:27:56

@Thomas: An SQLish form would be (reformatting because of the tiny input window):

select $opera,
       $opera <- work [^ is-composed-by] -> composer / name
           || "unknown author"
where
   $opera isa opera

You see that they are COMPLETELY identical. Why shouldn't they?

You can even move more of the PE into the WHERE part, if you are inclined to do so:

select $opera, $composer
where
   $opera isa opera
 & $opera . <- ........ -> composer == $composer

It is exactly the same thing. Everything else would be awkward.

(Gosh that window is small.)

Robert Barta - 2009-09-24 09:29:51

@Thomas: 2nd attempt

Right :-) It is so simple that it even hurts.

Thomas Neidhart - 2009-09-24 09:46:07

ok thanks Robert, but wouldn't the second version break the query, as it would not return operas which do not have a composer? How would you write the || in a where clause? Probably something like:

$opera . <- ... -> composer || undef == $composer ?

But I had obviously had a wall in front of my head, as I did not see that I could use all types of path expressions in the select style. Just had the predicate style in mind (as there are also not many examples in this regard ;-).

Robert Barta - 2009-09-24 10:32:47

> How would you write the || in a where clause? Probably something like: ...

You would not need the dot, as you use a variable $opera to refer to something. But otherwise you are spot on.

> But I had obviously had a wall in front of my head, ....

:-)

grove - 2009-09-24 18:18:14

> Get me the name of all operas and the name of their composer, if the composer is known, otherwise return "unknown author".

In tolog syntax this would be:

select $NO, $NC from
instance-of($O, opera), name($O, $NO), 
{composed-by($O : work : $C : composer), 
name($NC, $C) || $NC = "unknown author" }

or this if you got real implicit it would look like this:

opera-composed-by($O, $C), name($NO, $O),
{name($C, $NC) || $NC = "unknown author"}?

David Dudek - 2009-12-12 13:31:33

Great picture.

Name	required
Email	optional, not published
URL	optional, published
Comment

Spam	don't check this if you want to be posted
Not spam	do check this if you want to be posted

Larsblog

A path language for Topic Maps

Basics

Associations

Some more examples

Use case solutions

The axes

Similar posts

tolog updates

TMRA'05 — second day

ISO meeting in Leipzig

Comments

Add a comment