> The author .
> On Twitter
Posted in Technology on 2006-05-10 23:17
We've known for a long time that sooner or later we'd have to start supporting data types (numbers, dates, ...) in tolog, but so far we haven't done it. At the moment, the only way you can produce numeric data, for example, is through the count() aggregate function (almost true). All information retrieved from the topic map is either topic map objects or strings. The ironic thing is that the language (and implementation) actually supports typed data; it's the underlying Topic Maps implementation that doesn't support it.
At the moment we have a customer who wants to be able to do numeric comparisons. To take a simple example, they want to be able do queries like this one, which finds all people older than 30 years:
select $PERSON from instance-of($PERSON, person), age($PERSON, $AGE), $AGE > 30?
The trouble at the moment is that $AGE would be a string, so you'd have to compare with "30", at which point you run into trouble if you have someone older than 100 years in your topic map.
One way to handle this is to implement support for typed data in the underlying Topic Maps engine, but this is a non-trivial project, and not really on the cards right now. The alternative is to do type conversion in the query itself. That's easier said than done, however, as it's far from obvious what design to choose.
Simple type conversion predicates
One approach is to simply define a new predicate (perhaps in a module, perhaps not), say numeric that does conversion from any type to a number. With this approach the query above would become the following:
select $PERSON from instance-of($PERSON, person), age($PERSON, $AGESTR), numeric($AGE, $AGESTR), $AGE > 30?
The downside with this approach is that if the type of the second argument isn't defined you get what logic programming people call an "unsafe" predicate. That is, you couldn't use it to turn a number into a value of some other datatype, because you don't know what datatype to produce. You could produce all possible types, of course, but since there is no defined list of types, you would effectively have an infinite result, which is what's unsafe about it. Result: a predicate that doesn't give results in all situations. The table below summarizes what happens.
This loses one of the nicest things about predicates: that they work in all "directions", so that numeric($N, "12") would bind $N to 12, but numeric(12, $S) would give an error, rather than binding $S to "12".
Bidirectional type conversion predicates
Of course, one way to solve this is simply to define the type of the second parameter, too. The predicate number-string would do this, since the second argument would always be a string. Our example query would look the same, but we could now also use the predicate to turn numbers into strings. This would give us a table as follows.
Unfortunately, a problem still remains in the case where neither variable is bound. If you were to write number-string($N, $S) the result would be to ask for a table of all numbers and strings which correspond with one another according to the predicate. This is an infinite list, of course, and so definitely not safe to ask for.
Another problem, of course, is that with the numeric approach we only need one predicate for each datatype, while for the number-string approach we need one for each pair of datatypes. In other words, if we have n datatypes the former approach gives us n conversion predicates, while the latter gives us the number of combinations out of 2, which increases quite rapidly. At 5 datatypes we get 10 predicates, at 6 we get 15, at 7 21, and so on. (It's easy to show that the actual formula is (n2 - n) / 2.)
Magic type conversion
There is another approach that could be taken altogether. For each variable in a query the query processor determines its type before the query runs. This means that in the first example we are aware that the user is comparing a variable containing a string to a numeric literal, and we could just quietly do a type conversion to a number before comparing. Unfortunately, there are some obstacles to taking this route.
It's not clear how you would generalize this to work when two variables are being compared. Let's say we do a query where we want people whose IQ is greater than their shoe size (both pieces of data obviously coming from occurrences in the topic map). In this case we're comparing one string variable with another, and the query processor will have to conclude that this is OK.
It's also not clear how to know which type to prefer when two values of different types are being compared. We might say that literals always determine the data type, but, again, what if we have two variables? This might be worked around by having a defined precedence order for data types, but what if that gives you the wrong conversion?
In short, this approach does not appear to be workable.
So, what to do?
On balance, the original numeric approach seems to be the best. While it is not bidirectional we will still have to define one predicate for each datatype, and if we assume that the user will always want to convert from one datatype to another, the user will always have one predicate to hand that does exactly what's desired. The number-string approach might actually be harder to use.
Steve Pepper - 2006-05-15 01:41:03
A variant on the magic type conversion approach might be to restrict the comparators < and > to numeric values and forcing type conversion to numbers whatever the datatype of the values being compared. You wouldn't be able to do greater-than and less-than comparisons on strings, but how often is that really needed?
I would happily sacrifice this for the much greater user-friendliness of being about to write "$IQ > $shoe-size" and be done with it...
Lars Marius - 2006-05-15 10:07:38
I'm definitely sympathetic with your desire to get rid of the conversion predicates altogether. They definitely don't make things easier. I'm afraid there definitely is a need to compare values of other types than just numbers, however. For strings the need is probably limited, but I have seen people use it to get only part of an alphabetically sorted sequence. The most common alternative is comparing dates and datetime values. (Number of sales before May 1st, etc.)
We'll see if we can come up with a way to reduce the need for conversion predicates.
Add a comment