Architectural problems with archive web services
Posted in Technology on 2013-12-15 11:19
Ski tracks, Eggedal
I've already been through the problems with the NOARK standard, and hinted at issues with the way the web services to these systems have been designed. What I describe here applies not just to the semi-standardized NOARK web services, but also to the proprietary interfaces offered by the archive products.
Let's go through the issues one by one.
Using the proprietary interfaces
Organizations using the proprietary interfaces will find that as the number of archive integrations grows, switching to a different archive system gets more and more expensive. Basically because all of the archive integrations have to be rewritten from scratch.
In addition, they quickly find that since the client applications have been integrated directly against the archive, all integrations have to be retested every time the archive is upgraded. Once you go beyond 2-3 integrations this starts getting really painful.
(Quite a few organizations have "solved" this last issue by doing so many customizations to their archive system that upgrades are no longer possible.)
These web service interfaces, and the clients that use them, are generally synchronous. That means, they've been designed so that everything hangs until the archive has completed processing the request and sent a response. Since NOARK implementations are generally neither fast nor particularly stable, it follows that users wind up spending a good deal of time waiting for the archive. And if the archive is down substantial chunks of the functionality in client systems may not work at all.
Again, as the number of integrations grows, the problem gets steadily worse.
Gothic arches, York
Many people think that web services by their very nature are loosely coupled, but everything is relative. These interfaces are generally designed so that client systems must first either create or find and reuse a case, and only afterwards can they place a document in the case. Further, the clients need to fill in metadata according to the structure used by the archive. This means filling in taxonomy categories and the many fixed, required fields that NOARK archives love.
What happens when the interface is designed this way is that the internal metadata structure of the archive becomes hard-wired into the the code of all client applications. Every single client application must know what defines a case, which fields are used, and what values go where, in exquisite detail. As integrations accumulate, the structure becomes wired into more and more clients.
At one organization I visited, the archivists had wanted to reorganize the archive for seven years, but had been unable to, because that meant rewriting the single archive integration they had. While I was there they were finally able to carry out the reorganization, because the integration had to be ported to a new web service interface anyway.
Imagine the situation once you get up to five or ten integrations.
Generally, the quality of the metadata produced by these integrations is very, very poor. Even something so simple as who is responsible for the document very often gets lost. In many cases, a system user representing the external system gets registered as the responsible user for all documents coming from that system.
Nearly all documents in the archive have an external contact either as the sender or the recipient, and this is a key piece of metadata. A very important user requirement is to be able to see all correspondence with a single external contact. Most integrations, however, do not include the identity of the contact in the metadata, but simply repeat name, address etc for each document. And as these may change, be mistyped etc, the identity is effectively lost.
Most NOARK systems have an internal register of contacts that could be used, but since external systems generally have their own registers this becomes too cumbersome to support, and so data quality goes out the window. And even if the register were used, typically one would import the client contacts into the archive, duplicating those that are already there. And different clients generally have different contact databases, further compounding duplication.
Beach huts, Whitby
So how can this be solved? Actually, it's not that hard. What you need is a web service interface where clients can hand over documents with the metadata they have, using the client's internal metadata vocabulary. The server queues these, then responds. Once it's ready, the server translates the metadata into its internal model, adds additional metadata, and archives the document. I should explain how, but this blog post is already too long, so that will have to wait for the next blog post.
In my earlier piece on NOARK systems I accused the National Archives of standardizing the one thing that should not be standardized: the internal model
Read | 2013-11-24 11:41
I'm writing about a phenomenon that's specifically Norwegian, but some things are easier to explain to foreigners, because we Norwegians have been conditioned to accept them
Read | 2013-10-30 10:24