The web's identity crisis and httpRange-14
Posted in Technology on 2007-10-08 08:54
Sunset, Rasnov, Romania
URIs are used to refer to both information resources (which are downloadable over the net) and abstract concepts and physical objects (which are not). In many contexts there is no way of knowing whether a given URI identifies an information resource or something else, and this has become known as the web's identity crisis. This problem has received most attention in the context of RDF, where it definitely does exist, but it also exists more generally whereever URIs are used for identification (and not just simple addressing of information resources).
The W3C did not initially see this as a problem, despite quite a bit of pushing from a minority of Semantic Web people. Eventually, however, the Technical Architecture Group (TAG) accepted this as an issue in 2002, with the name httpRange-14. In 2005 they came up with a "solution", as follows:
The TAG provides advice to the community that they may mint "http" URIs for any resource provided that they follow this simple rule for the sake of removing ambiguity:
- If an "http" resource responds to a GET request with a 2xx response, then the resource identified by that URI is an information resource;
- If an "http" resource responds to a GET request with a 303 (See Other) response, then the resource identified by that URI could be any resource;
- If an "http" resource responds to a GET request with a 4xx (error) response, then the nature of the resource is unknown.
Basically, what this says is that for any given URI you can find out whether or not it represents an information resource by resolving it. If the response you get has a response code of 303 (as opposed to the normal 200, which means OK) you know that it doesn't represent an information resource at all.
So that should be fine, then, right? Well. No. Not really. In fact, this is not a solution at all. I haven't bothered writing up a critique of this before, because this proposed solution seemed less like a solution and more like the TAG saying they didn't consider this to be a real problem. So there didn't seem to be any point in complaining. The discussion after Steve Pepper's talk at Extreme Markup 2007 showed that people actually do take this "solution" seriously, and so it seems like it may be worth pointing out its weaknesses, just so people know.
1. It won't scale
Let's say you have a data set with 10 million resources, which is not unreasonable. How long does it take to resolve 10 million URIs? And what do you do about servers which no longer exist, and those which exist but never respond? Let's say you multi-thread and do 10 in paralell, and that on average it takes one second to get a reply. At this rate it will take more than 11 days to resolve the resolvable URIs, but some will of course not be resolvable at all and will thus remain ambiguous. Clearly, this is not a very scalable solution.
2. Offline systems
Then there's the question of what happens if you're not on the open internet, or if you're not connected to the internet at all. Obviously these systems will not be able to resolve all URIs. So this solution only works for some systems. How can that be considered acceptable?
The 303 response code already exists, with a specific meaning. RFC 2616 defines it as follows:
The response to the request can be found under a different URI and SHOULD be retrieved using a GET method on that resource. This method exists primarily to allow the output of a POST-activated script to redirect the user agent to a selected resource. The new URI is not a substitute reference for the originally requested resource. The 303 response MUST NOT be cached, but the response to the second (redirected) request might be cacheable.
So if you get a 303 response code you know... well... what do you know? The TAG resolution says "the resource identified by that URI could be any resource". What's that supposed to mean? But let's assume it means the resource is not an information resource. We don't know that for certain. It could be a redirect in the old style defined by RFC 2616.
The only unambiguous response we can get is a 200 response code, which would tell us that we are definitely looking at an information resource.
4. Doesn't support all URIs
Another obvious limitation is that this only works for HTTP URIs. These are the most common URIs in use today, but even so this is quite a big limitation.
Even if we accept that, many URIs used as identifiers today are of this form: http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement. That is, they have a URI fragment at the end. Now, the trouble is that in these cases you don't actually send the fragment over HTTP; instead, you send only the part before the # character. This means that you can only test the URI itself, and not the entire URI plus fragment. If we assume that either the URI must identify an information resource (or not), and that all fragments inside must follow the result for the URI, this might still be workable.
5. Not used
So let's try it, shall we? The URI http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement identifies the concept of an RDF statement. That is, the class of all RDF statements. This is not something that could reasonably be described as an information resource. It's an abstract concept, more abstract than the concept of "love".
So what happens if we try it? Well:
[larsga@os289 Desktop]$ telnet www.w3.org 80 Trying 22.214.171.124... Connected to www.w3.org. Escape character is '^]'. GET /1999/02/22-rdf-syntax-ns HTTP/1.0 HTTP/1.1 200 OK Date: Sun, 30 Sep 2007 16:56:25 GMT Server: Apache/2 Last-Modified: Fri, 30 Jan 2004 14:05:58 GMT ETag: "3d222bc2e3d80" Accept-Ranges: bytes Content-Length: 6588 Cache-Control: max-age=21600 Expires: Sun, 30 Sep 2007 22:56:25 GMT P3P: policyref="http://www.w3.org/2001/05/P3P/p3p.xml" Connection: close Content-Type: application/rdf+xml <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:dc="http://purl.org/dc/elements/1.1/"> ...
Whoooops. We get a 200, meaning that it must be an information resource. But that's wrong. So the W3C doesn't follow the TAG finding even for the most basic concepts in RDF.
6. Doesn't support PSIs
Finally, it's not possible to use the same URI with two meanings, where one is the information resource identified by the URI, and the other is some abstract concept described in that resource. This is how PSIs work in Topic Maps, so that's kind of awkward from a Topic Maps/RDF coexistence point of view. Under this solution all URIs must either be the URIs of information resources, or the URIs of abstract concepts. Topic Maps, however, allow a single URI to be used for both purposes without confusion.
It is possible to support PSIs under this solution, provided the URI used as the published subject identifier gives a 303 response when resolved, and redirects to a human-readable indicator. This is not how currently established PSIs work, however. Given the other problems with this solution there doesn't really seem to be much reason to change how they work, either.
Like I said, I can't really see that this is a workable solution. If anyone can tell me I'm wrong I'd be happy and interested to hear the explanation.
I've argued for a long time that the RTM vocabulary for mapping RDF to Topic Maps makes it possible to use RDF vocabularies in Topic Maps
Read | 2005-10-24 21:57
Read | 2008-05-21 16:48
Danny - 2007-10-09 16:24:08
Hmm, sorry Lars but without further explanation I don't really see any of these points undermining the TAG choice.
Under what circumstances might you want to know if a resource is an information resource or not? On the Web one reason would be to see if there was any other information available - which would usually be retrieved via a GET. The 303 is consistent with that.
I wouldn't have been surprised by something solid re. PSIs, but "Doesn't Support PSIs" here doesn't hold much water. An identified thing either is downloadable over the Internet or it isn't. "it's not possible to use the same URI with two meanings" - that's a feature, not a bug. URIs are opaque identifiers.
If you need to say whether or not a resource is an information resource offline, it's easiest enough to make an explicit statement to that effect.
The rdf:Statement demo is nice, but the GET isn't actually applied to the URI, the # etc is stripped. This situation as a whole is far from perfect, but generally seems good enough to allow Cool URIs for the Semantic Web.
Conal Tuohy - 2007-10-09 19:09:00
In your example about the nature of the referent of the URI http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement you actually receive an HTTP 200 response for the URI http://www.w3.org/1999/02/22-rdf-syntax-ns which means that's an information resource, but the URI http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement is a different URI. You can't therefore conclude that http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement also refers to an information resource.
That's my understanding of how it's supposed to work, anyway. If this interpretation is correct then I think it does mean you can't use fragment identifiers to refer to information resources.
I completely agree that the "solution" is broken. The meaning of a URI should depend on the context in which it's used (as they do in the Topic Maps paradigm).
Lars Marius - 2007-10-10 10:10:42
Danny, you wrote: "Under what circumstances might you want to know if a resource is an information resource or not?" Well, that question is the one the TAG finding is there to help you answer. So the TAG assumes that this is an interesting question. If it's not then the TAG finding is irrelevant.
You also wrote: "If you need to say whether or not a resource is an information resource offline, it's easiest enough to make an explicit statement to that effect." How do you do that? RDF doesn't tell us, and the TAG finding tells us to do it another way. So what should we do?
Conal: It may be true that you can't conclude that rdf:Statement is an information resource by testing the URI without the fragment. But if that is the case then there is a huge class of HTTP URIs which the TAG finding does not apply to, including the most basic URIs in RDF, RDFS, and OWL. So either way the finding is broken. (I don't think you are right, BTW, but even if you are the TAG is no better off.)
You write: "If this interpretation is correct then I think it does mean you can't use fragment identifiers to refer to information resources." The trouble is that people do that all the time. The interpretation of fragment identifiers given in the RFCs is that they refer to parts of information resources (and the parts are themselves information resources). So I don't think this is correct.
Marc de Graauw - 2007-10-11 09:08:54
Unfortunately you misinterpret the TAG ruling on several critical points. I did the same, once, and thought this a non-solution, but now I find it much better than I used to.
First, the TAG http-range14 decision isn't intended as a mechanism to "find out whether or not [a URI] represents an information resource" as you say. There are other mechanisms in most contexts for that: an RDF statement such as: http://www.marcdegraauw.com/person/larsmariusgarshol a foaf:person or in Topic Maps: instanceOf(lmg, person)
What the TAG decision does is saying what you should serve from a URI which does not denote an IR, and it says in those cases one should serve a 303 to a page describing the non-IR thing. This post by Pat Hayes (a former critic of the TAG decision) explains it excellently: http://lists.w3.org/Archives/Public/www-tag/2007Sep/0017.html
The solution you describe for PSI's is perfectly acceptable, in fact DC has done this. They used to serve 200 responses with pages describing their concepts, now they serve 303 redirecting to those pages. If DC, and many others, can do this, PSI authors can too.
Your second mistake is in the #. A hash-URI can identify anything [*]. Since there is no problem, and the hash is never sent to the server, the TAG decision doesn't say anything about it. So: http://www.marcdegraauw.com/person#larsmariusgarshol can identify you, and http://www.marcdegraauw.com/person/ can identify an IR, maybe a list of person names. I think Dan Connolly and Tim Berners-Lee prefer this solution to 303-redirects.
[*] Technically, what a hash URI identifies is dependent on the media type. For HTML it typically is a point in a document, but Dan Connolly considers changing this, see: http://lists.w3.org/Archives/Public/www-tag/2007Sep/0205.html I sure think this needs changing, but right now plenty people use hash URI's for non-IR's in practice without much trouble. If I don't serve anything (404) from http://www.marcdegraauw.com/person/ the problem vanishes altogether, though that's inconvenient.
Alaric Snell-Pym - 2008-05-29 05:30:58
As I see it, the root problem stems from the fact that people started using HTTP URIs for non-information resources in the first place. URNs would clearly have been a better choice - the whole URL/URN distinction revolves around whether the resource has a resolvable network presence (like a file or a service) or not (like an abstract concept).
It's not hard to get URNs; you can get an OID from IANA or from your national registry and use urn:oid:..., and there are other schemes about (or you can be naughy, like a lot of organisations, and just unilaterally claim urn:<orgname>:<whatever you want>... although I wouldn't recommend it when RFC4198 exists).
So why people use URLs for non-locateable resources, when URNs are easy to obtain and unambiguously avoid the whole problem, beats me!
Lars Marius - 2008-05-29 11:30:16
Alaric, I definitely agree with you on where this started to go wrong. The trouble is that now the W3C is stuck with a number of HTTP URIs referring to abstract concepts, and so they have to deal with that somehow. URNs might have been an alternative, given a suitable registered scheme, but that's too late now.