Extracting a subset from a topic map
I've been thinking for a long time about how to easily extract meaningful subsets from a topic map, without having to do any custom programming. This is a quite general problem that recurs again and again in real use of Topic Maps.
One real-life example comes from my photo topic map, which is a topic map of all my digital photos. It shows for each picture who is in it, where it was taken, and during what event (if any) it was taken. Each picture also has a caption, a time/date when it was taken, and possibly a description.
Now for the problem: let's say I've been to a conference, for example, TMRA'05, and someone asks me for my pictures from that conference. In that case I should be able to quite easily remove all photos that are not from TMRA'05. However, this doesn't solve the problem completely, because it will leave lots of places in the topic map where no photos were taken, and lots of people of whom there are no photos.
So, what to do? This has been at the back of my mind for a long time, and I've thought of several possible alternatives. One is to write a set of queries to extract only the wanted topics, but this is awkward as it requires one query per topic type. Another was to attach some kind of containment annotation to some association types, so that deletes could ripple up the containment tree, but I couldn't find any way to start that process.
Only tonight did it occur to me that there is an easy way to solve this. Dmitry has been advocating that TMCL should have a standard representation for schema violations, and in the current draft something called conflict items do this. This idea connected with the need to remove "bad" topics, and presented a solution. The solution is simply to use the schema, and automatically delete all topics that are not valid.
The idea was to do this as follows:
- Create a copy of the original topic map,
- Delete unwanted topics with a query (basically one that finds all photos not taken during TMRA'05), and
- Validate the topic map using the normal schema, deleting invalid topics, then repeating until the topic map is valid.
This would delete person topics with no photos, because these would be invalid. It would also delete events from which there are no pictures, as those would be invalid, too. I also naïvely thought that I could delete locations, but this turns out not to work with OSL. The reason is that the rules for locations are more complex. If a location contains other locations, it's not necessary for any photos to have been taken there, and OSL can't express this. Given that ability, however, even locations would be handled.
The general idea, though, seems to work. It also seems to raise the question of whether AND and OR rules should be supported by TMCL. TMCL will have a combination of OSL-like capability (called TMCL-Schema) and Schematron-like capability (called TMCL-Rule), and so should be able to do what I want regardless.
One question about this method, however, is whether it is efficient enough to work. It works for extracting a subset out of a topic map of 12 kTAOs, but it would it work for really big topic maps? Could some other method that is efficient enough to work be applied instead?
Update 2005-11-20: I've now done a prototype of this in Jython, based on the OKS, and 68 lines is all that's required. I had to write a new validation handler that collects the invalid topics, and that was the only real innovation in the script. I've tried it, and it works as expected. In-memory it's very fast, but I doubt this would work on a huge topic map; one would have to find some way to reduce the fragment being worked on before starting.
Last time I wrote about how I used OSL to extract a fragment from a topic map
Read | 2005-11-30 23:49
The TMCL standard now seems more or less stable, and so now it is finally possible to explain to outsiders what the language looks like and how it works
Read | 2008-10-03 17:33