Datatype validation with TMCL

<< 2009-07-20 14:37 >>

Railway bridge, Prague

It's long been generally assumed that TMCL (the Topic Maps Constraint Language) should be able to validate datatyped values, but very little thought has so far been devoted to exactly how. It may look like a trivial issue, but in fact datatypes is an enormous tangle of complex problems. To pick one example at random, consider the ordering of time durations in XML Schema. This posting is an attempt to consider what TMCL should and, equally important, should not do.

Basics

At the moment TMCL lets you associate a datatype with an occurrence, stating that all occurrences of that type must have only values from that particular datatype. This is fine, as far as it goes, but what does it actually mean? Let's consider a few cases to get a feel for the problem.

The user says:

shoesize isa tmcl:occurrence-type;
  - "Shoesize";
  has-datatype(xsd:integer).

If we then encounter an occurrence with value "42" and datatype xsd:integer it's an open-and-shut case. The value is obviously valid. Similarly, if we see ("george", xsd:string), it's obvious that this is not valid.

There are some corner cases, however. What if we encounter ("george", xsd:integer)? The specified datatype is the valid one, but the value itself is not valid. It seems reasonable to suppose that this occurrence should be rejected as invalid. This requires TMCL implementations to know the correct lexical representation for the datatype, but that seems fair.

Derivation

Suppose we find an occurrence with ("43", xsd:int), then what? Now, xsd:int is derived from xsd:long, which is again derived from xsd:integer. "43" is a valid xsd:int, and we know that every xsd:int is actually also an xsd:integer. So perhaps this is acceptable? It seems reasonable to suppose that it is, but then implementations must know the derivation relationships between the different datatypes.

Unfortunately, derivation relationships between XML Schema datatypes are not a matter of simple subtyping, where every value of the derived type is necessarily a valid value of the base type. For example, a type can be derived by union, so that one could derive a new type stringOrInteger that is the union of those two base types, thus neatly inverting the subsetting relationship. There is also derivation by list, which I'm ignoring for now as irrelevant.

In any case, if this is to be supported, TMCL implementations must somehow know the subsetting relationships between datatypes.

Subsetting is also more complex than it may seem at first sight, because XML Schema datatypes actually have two sets associated with them: the value space and the lexical space, and these do not necessarily have a one-to-one mapping. This gets confusing in several ways. For example, it's possible for one type to be a true lexical subset of another without their value spaces having any overlap whatsoever (think of xsd:string and xsd:anyURI). Not only that, but it's possible for one type to be a true value subset of another without being a lexical subset.

Let me give some examples to make this more concrete. If you ponder the various relationships here you'll see that there are all kinds of weird combinations.

Datatype	Lexical space	Values
xsd:boolean	0, 1, true, false	true, false
lmg:strict-bool	true, false	true, false
lmg:bad-bool	1, true	true
lmg:zero-or-one	0, 1	0, 1

So what do we really want to know? It's hard to say at this point. Here are some questions that are at least relevant, given a pair (literal, datatype1) and a declared datatype2:

Is literal actually a valid datatype1 value?
Is every datatype1 actually a datatype2 (both lexically and by value)?
Is literal actually a valid datatype2 value?

It seems obvious that the last question is the most important. If the value given is not valid according to the declared datatype then that's the end of it: the value is not valid. Unfortunately, this is more complicated than it seems. Let's try an example. The declared datatype is lmg:zero-or-one, the literal is "0", and the given datatype is xsd:boolean. Now, this would seem to imply that the intended value is in fact false. So lexically, the occurrence is OK, but the value is wrong.

Let's try another case: ("43", xsd:decimal), with declared datatype xsd:integer. Now, the given literal is lexically within the declared type, as is the value. So this should probably be considered OK. Let's try a slight twist: ("43.0", xsd:decimal). The value is still within the declared type, but we are outside the lexical space. So this should probably not be considered OK.

If we turn to XML Schema itself, the notion of a string being datatype valid is defined as it being within both the lexical and value spaces of the declared datatype. However, this considers only validation of strings, and not pairs of (string, datatype), which is what we face.

Here is a possible way to approach datatype validation in TMCL, given a literal, an instance datatype, and a declared datatype:

The literal is assumed to represent a value in the value space of the instance datatype.
It must therefore be XML Schema datatype valid according to the instance datatype.
It must be also XML Schema datatype valid according to the declared datatype.
Finally, the value spaces of the instance datatype and the declared datatype must be known to overlap.

It follows from this that we need to know the value space relationships between datatypes, but that knowing the lexical space relationships is unnecessary, because this can be tested directly.

Minimal requirements

In other words, the minimal requirements seem to be that TMCL implementations must know the lexical space of each supported datatype, and also value subset relationships between supported datatypes.

Inside St. Nicholas's church, Prague

Which datatypes?

Another question is which datatypes TMCL should support. It seems obvious that at the very least the datatypes supported by CTM must be supported: xsd:anyURI, xsd:decimal, xsd:integer, xsd:date, xsd:dateTime, xsd:string, and ctm:integer.

The value-space relationships between these types are as follows (in CTM syntax):

value-subset-of(subset: xsd:integer, superset: ctm:integer)
value-subset-of(subset: xsd:integer, superset: xsd:decimal)
value-overlap(overlaps: xsd:decimal, overlaps: ctm:integer)

Note that the last association is in fact implied by the preceding two: they share xsd:integer as a common subset, but extend it in different ways: xsd:decimal with fractional numbers and ctm:integer with *.

But what about the rest of the XML Schema datatypes? Should they be supported? There's quite a few of them, but on the other hand implementations don't need to know all that much about them. So implementation need not be that hard, especially given that TMCL validators must already support regular expressions.

And what about other datatypes? Well, which ones would that be? It's not clear that there are any. XPath 2.0 defines some, but are they needed? Probably not.

And what about user-defined datatypes? That's a tough call. TMDM does not limit what datatypes can be used, but if they cannot be validated beyond a simple matching of datatype URIs to see that the right URI appears in the right place that may not be very useful. Full support for this would allow datatypes to be defined for more restricted ranges of numbers and dates, for example, resulting in tighter validation. But is it worth the effort? And what would the effort be? This is not clear.

Comments

C. M. Sperberg-McQueen - 2009-09-03 13:09:20

Nice essay. A couple points about XSD may be worth mentioning, as they touch upon pain points or problems you identify.

It's true that in general the lexical-mapping relation (between literal and value) is not 1:1. In the primitive datatypes and restrictions of the primitive datatypes, however, the mapping is always a function: each literal in the lexical space maps to exactly one value. In unions and in the special types (anySimpleType, and in XSD 1.1 also anyAtomicType), the mapping is not necessarily functional.

The three questions you identify as interesting are in fact interesting; I don't think I have others to add. With regard to the second question ("Is every datatype1 actually a datatype2 (both lexically and by value)?"), there may be a useful analogue in the XSD spec. The situation where you have a declared type, a literal, and a type associated with a literal is similar in at least some ways to the situation in which an element has a declared (simple) type, and an element instance in the document being validated has an xsi:type attribute specifying a different type.

XSD 1.0 says that the type named by the xsi:type attribute must be "validly derived from" the declared type of the element. This is an unfortunate choice of words, since if the declared type is a union of xsd:integer and xsd:string, then xsi:type is allowed to name either of the member types, which are "validly derived" from their union only in the topsy-turvy terminology of XSD 1.0. In XSD 1.1, the substantive rule is the same, but the terminology is changed: the one type must be "substitutable for" the other.

It would seem natural to me that in the situations you describe, the type associated with the value should be substitutable for the declared type. That would rule out accepting ("43", xsd:decimal) when the declared type is xsd:int, which seems to me likely to be the right call. Refusing to accept values labeled with the name of an ancestor type is more important when user-defined types are supported, and probably less important if only specific datatypes are built-in.

At the risk of trying your patience, let me explain why I don't think ancestor types should be accepted.

If I specify that hat size is a subtype of integer, it may well be precisely because I want to draw a conceptual distinction between the two. Calculating an integer by dividing a street address by the height of the family's eldest child, and rounding, may yield an integer, but it does not yield a hat size. If the declared type is msm:hatsize, an arbitrary integer should NOT be accepted, even if it's in range. If a hat size is calculated in some way that the user can see is plausible, then the user should coerce the value to msm:hatsize and take responsibility for the claim that it's plausible as a hat size.

Lars Marius - 2009-09-11 06:57:03

"It would seem natural to me that in the situations you describe, the type associated with the value should be substitutable for the declared type."

Yes, this sounds reasonable, and your rationale for why ancestor types should not be accepted also makes sense to me. This also fits how most static type systems (in OOP, for example) work. We'll discuss this and see if the committee agrees.

However, I can't seem to find a definition of substitutability in the 2009-04-30 draft. Did I miss something?

XML Schema also does not seem to define anything called subtyping, but I guess A subtype-of B iff A derived-from B?

C. M. Sperberg-McQueen - 2009-09-11 21:53:28

Sorry about the difficulty finding the term. Strictly speaking, the term used is "validly subsitutable"; the definition is in section 3.3.4.2 of the Structures spec (http://www.w3.org/TR/xmlschema11-1/structures.html#key-val-sub-type)

And yes, the term "sub-typing" is carefully avoided; the community (or communities) involved in preparing XSD 1.0 turned out to have no consensus on what that term means, or should mean. So the XSD spec speaks in terms of types being derived from other types, either by extension or by restriction; those who want the rule that all instances of type A are also of type B will derive A from B by restriction, and those who want only to ensure that any instance of type A will have children corresponding to all of the required children of B, and possibly more besides, will derive A from B by extension.

That's for complex types; for simple types, if A is derived from B at all, then A is a restriction of B and thus a subtype of B, period. Apparent counter-examples use carefully chosen terminology: List types are constructed from their item type, but derived from anySimpleType. And similarly union types are constructed from their members, but derived either from anySimpleType or by restriction of another union type.

So I think your rough equivalence of terminology is likely to hold, for simple types, unless someone has an unusual or eccentric idea of what subtype-of ought to mean.

Name	required
Email	optional, not published
URL	optional, published
Comment

Spam	don't check this if you want to be posted
Not spam	do check this if you want to be posted

Larsblog

Datatype validation with TMCL

Basics

Derivation

Minimal requirements

Which datatypes?

Similar posts

A TMCL tutorial

Typed data in tolog

A CTM tutorial

Comments

Add a comment