NULL in Topic Maps
Posted in Technology on 2006-07-08 19:42
One point on which Topic Maps differ from most other information representations is the handling of unknown or missing information. In relational databases these are represented using the special NULL value, and the same is the case in object-oriented programming. In XML there are different ways to approach this issue, one of which is the xsi:nil attribute.
So how would we represent this in Topic Maps? Let's start with the case of missing information. By this I mean not information that exists, but which the person building the topic map does not have; I mean information that the ontology says might apply to topics of a particular type, but which for some of the topics just does not exist. (If this isn't clear, the example below will probably clear it up.)
The Italian Opera Topic Map has two occurrence types date-of-birth and date-of-death for person topics. How would we handle the case of a person who has been born but not (yet) died? One could think of several possibilities, but the only one I've ever seen is to simply leave the date-of-death occurrence out. This corresponds neatly to the common-sense notion that the person quite simply does not have a date of death. There is thus no need for a NULL value in Topic Maps.
The same applies with associations. In the Italian Opera Topic Map you also find place-of-birth and place-of-death associations. Obviously, if the person has not died yet there is no place of death, and consequently no association of that type.
Unknown information is where the information exists, but the person creating the topic map does not have it. As an example, imagine a person whose date of birth we don't know. Since the person exists there must be a date on which the person was born, but we don't know what it is. This case is actually harder to handle cleanly than the other one, since in this case leaving out the occurrence suggests that there is no date of birth, but this is wrong. So in this case there might actually be a use for a kind of NULL or UNKNOWN value, but no standardized such value exists in Topic Maps today.
For associations the problem is even worse, since here we must have a topic to represent the unknown place of birth. There are several different possible ways to handle this:
- Define a separate "Unknown" topic for each such case. This is cumbersome, but provides a way to state what is known about each unknown topic, and to distinguish which unknowns are known to be the same, and which are thought to be different.
- Define a PSI for the unknown topic. This will always be used whenever the associated topic is unknown. The trouble with this is that it will have to be an instance of all types in order to avoid validation problems, and this is unlikely to be allowed. So for this to be workable an exception will have to be made for this particular topic from the ordinary validation rules.
- Define one unknown topic for each topic type. This seems like an awkward compromise, especially as it may sometimes be hard to decide which types in a type hierarchy to make unknown topics for. And imagine that you only know that the person was born in a building of some type, but the building topic type is abstract, because all instances of it are actually instances of some more specific subtype. Clearly, this approach, too, leads to problems.
Another problem common to approaches 2 and 3 is that, again unless special rules are defined for these topics, they imply that everyone whose place of birth we don't know was born in the same place. So overall approach 1 seems best.
Why this difference
I find the question of why Topic Maps differ from the relational and object-oriented models on this point to be interesting, although the answer is actually kind of obvious. In the relational and object-oriented models the physical storage of data follows the schema, which means that if you define a place-of-birth column there is going to be a space set aside in each record for this information, and so you are forced to put something there, even if that something may turn out to be NULL.
In Topic Maps, as in XML, the schema is something that exists independently of how the data is stored, but against which the data may be validated at need. This means that the same problem with the predefined storage position for each data item does not exist, and so one is not forced to come up with a NULL value.
The design of TM/XML, first heard of at TMRA'05, has now at long last been finalized, and the paper about it sent off to the publishers
Read | 2005-12-03 16:35
A discussion on Svein's blog regarding FreeBase and a comparison of its data model with that of Topic Maps brought up some interesting question regarding Topic Maps that I think are worth discussing
Read | 2008-01-16 18:13
Robert Cerny - 2006-07-11 13:46:53
I must slightly disagree. I think one big advantage of Topic Maps over relational databases is that the ontology is not carved in stone, providing a lot of flexibilty when it comes to change. In the current situation (using RDBMS) software development cycles are slow, because the defintion of the "core model" impacts everything else. If the database design cannot keep up with the requirements anymore, due to whatever reason, it takes a long time ("We have to change the database"), and sometimes is impossible to adjust ("We cannot do that, because abc needs it that way!"). Another reason for this is the the consumers of the data (e.g. a web application, a rich client or some reports) tend to have very strong "ontological requirements" and cannot do their work if part of the information is missing, many times even if they could do it (but other things do not compile). Coming back to NULL, i think that Topic Maps are doing the right thing. If the date-of-birth occurrence is missing, it is not there. The consumer has to deal with that. Going back to your example of unkown information. The person who must have a data-of-birth did, also did have a color-of-hair, and a length-of-left-pinky. I think it is not job of the data holder (Topic Map) to fullfill a certain consumers need.
But still, i do see a need for a NULL value :-) If one learns that Person A and Person B are born on the same day, without being told the specific date. That's when it would come in very handy.
Kind regards, Robert
Lars Marius - 2006-07-11 15:12:29
We agree that flexibility is a big advantage of Topic Maps over RDBMSs. This blog entry was actually more about exploring NULL handling in Topic Maps than about evangelization, so I tried not to go into the discussion of advantages at all. I just tried to formulate my thoughts on how to handle this and make sure they were consistent and right.
I think your "A and B born on the same unknown date" use case can be handled with associations (my pattern #1), but not with occurrences or patterns #2-3.
rho - 2006-07-16 12:06:49
Hmmm, there are two things here
- TMs should probably be interpreted with an 'open world' assumption. So what is not explicitly said in a map, is not necessarily wrong. This is compatible with merging maps as one would expect that the extracted knowledge always monotonously increases with more statements.
The above has its obvious problems with query languages which offer negation.
- I shy away from using NULL or the 'undefined-thing'. My approach is:
- If there is a statement which I do not know, I simply do not add it to a map. Easy :-) - If there are parts of a statement which I do not know, I use rather something like 'a-woman', 'a-person' where I can state somewhere
charles is-married-to a-woman
a-woman isa woman
dmitryv - 2006-07-21 05:26:29
I agree with rho, "Sort-of skolemization" works well in many cases. I also use sometimes "logical functions" to identify some subjects "http://..../left_hand(John)"
I also think that in some applications it is possible to use "meta-statements" such as
unknown(john : topic , age : occurrence) unknown(john : topic , person : plays_role, born-in : association)
If we have schema for a type "person" then explicit statements like these can prevent from generating validation errors (in some modes).
I think it helps to create "strong" schemas that reflect what should be known about topics, not what we can know sometimes
Gabriel Hopmans - 2006-08-09 21:05:11
In this whole discussion it is a lot around 'when creating the topic map', but in a lot of projects we are working on developing topic maps in distributed environments where we do upconversion of legacy data, merge the maps and then browse/see the overlaps, the errors, the duplicates etc.. so the data is already there and often you can't say afterwards: "The consumer has to deal with that."
In the field of administration a lot of details about persons are not known, administrators make mistakes, or when identifying themselves it is not there when one makes the administration etc.. Thus it is 'not only the topic map developer' but often the NULL value is already there and one can not say that the 'consumer has to deal with that' DIRECTLY.
It is not really strange to imagine the following business cases : - Persons are not born with several ids thus often they are missing when they are added to a system - persons are misusing identities or don't fill in the required details to make things fuzzy - persons have multiple id cards or multiple nationalities, different notations (and misuse them) etc etc..
therefore it is also interesting to make queries with things like: - 2 different topics with the same name (last name, first name) - persons with same name but with different ids or with a different birth-date - persons with different names but with some very interesting parallels. (although when showing all the information on a person in an application again one often has to reckon with non-failing clauses in the queries :)
Applying these kind of queries is why Topic Maps is magic when we apply them to these cases (often in a distributed environment).
Therefore I also agree with Lars that option 1 seems to be best, to make statements upon unknown topics. After the first administration tasks, several users of the systems are discovering duplicates and they often want to clean up the mess.. therefore one needs to make a statement that also often needs to be checked by others again so that one can correct the possible error, make a counter statement, etc..