Topic Maps and Semantic Search
Posted in Technology on 2007-01-30 16:14
Tonight was another one of the monthly users' group meetings on Topic Maps, and tonight the subject was Topic Maps and Semantic Search. There were two presentations, one on Compass, a Topic Maps search tool, and a case study of a Norwegian web site.
Eszter Horvati from Ovitas spoke first about Compass (which she also presented at TMRA 2006 in Leipzig). Compass is a commercial Topic Maps-based search tool based on the open source Lucene search engine. It requires you to have a "semantic model of the domain" (ie: a real topic map), names for your topics (used as synonyms), and associations. The association types are annotated with a number between 0 and 1 indicating how strongly or closely associations of that type bind the topics on either side together. Compass uses multiple names for the same topic to allow you to hit the same topic using different search strings.
You have to build the topic map yourself before loading it into Compass, and this includes annotating the association types with their weights. They use an Excel-based editor for this, and load it into the TMCore engine from NetworkedPlanet. It sounds as though you have to update the topic map using Excel and then reload it into Compass.
The basic search procedure uses the topic map to expand the search. The search first finds relevant topics, and then their related topics. The whole set is handed off to fulltext search, and Compass then uses all of this to produce a combined result. It seems pretty clear that the weighting is used here to combine the relevance of topics you've hit directly with their related topics, although Eszter did not say this explicitly. If you don't hit any topic the result is basically a normal fulltext search. If you enter lots of words it does a separate search for each word; from the demo it seems to understand compound terms (like "Topic Maps").
The search result is represented in XML and can be presented in any way you want. Hits appear in a normal Google-like list, together with a list of hits grouped by the topic that's been hit, and a small set of relations to other topics. The last bit is the relations of all the topics used in your searched, merged together. Eszter seemed to indicate that you could display them separately, if desired. Clicking a related topic does a search for that topic as though you'd typed it in.
It's not too clear how the resources enter the picture, but it sounds as though they simply index the web site, and that the resources are not directly connected to the topic map at all. That is, it's the fulltext search that brings up the related resources. You can configure which pages are indexed, but I'm not sure how.
Eszter said it was easy to integrate with content repositories. The repositories can send requests to the indexer via a web service interface (REST style) to have it add, reindex, or delete a resource. Resources are handled by passing in HTTP URIs, which are then used by the indexer to both identify and retrieve the resource.
Compass is .NET-based, and as far as I can tell they use the .NET port of Lucene. The first project it is going to be used on is Felleskatalogen, a catalogue of drugs based on drug descriptions from the Norwegian Medicines Agency.
Stian and Petter speaking
The next presentation was by Stian Danenbarger (Bouvet) and Petter Thorsrud (the customer) about how search is done in the new Topic Maps-based site regjeringen.no (ie: TheGovernment.no). The site will go live on February 12th if everything goes as planned. The site is the main communications channel for the government to the citizens, which includes both the man in the street and professional users (such as journalists). The content seems to be mainly documents from the various departments.
The existing site primarily focuses on professional users, with a navigation based on document types and the organizational structure of the government. This worked well for professional users, but the site owners considered the metadata insufficient for their goals. The new site is based on the EPiServer web CMS and the TMCore Topic Maps engine. EPiServer handles the actual content, while TMCore takes care of the metadata and relations. They also wanted a good search engine, and chose to continue their use of FAST.
The original metadata on documents was the typical administrative stuff, such as document type, owner, language, time, and status. In addition they also had the titles, some hundred keywords, etc. In the new system they defined a set of "navigational axes", and a navigation structure for each. Examples were things like organizational structure, document types, subject structure, etc. They created a hierarchical subject structure of about 200 keywords with some cross connections, and with a set of about 3,500 keywords underneath that again. There's also a bit of more semantic information, like which governmental agencies are responsible for a subject.
The editors have to add metadata for this to work, and with 85,000 existing documents this was a bit of a challenge. They wanted either automated or semi-automated classification to ease the strain on the editors, both for existing content and when publishing new content. They are using the automatic classification tools in the FAST toolkit. You need to train this with examples for each category (or topic in Topic Maps), and new documents will be proposed added to the categories where they are sufficiently similar to documents already categorized into that category. FAST uses a vector-based system, which is what provides the similarity measure. So far they have not been able to get this to work well enough that they dare to use it in production.
They ran out of time towards the end, but showed an example of the search, which was basically a fulltext search of the content, where you could use the topic map for filtering of the results. This was of the standard faceted filtering type (which is becoming quite common on Topic Maps-based sites), where you could filter your search by document type, subject, and which organization published it. One interesting addition is that you can do searching within the longer documents as well, where search results are sections instead of complete documents. You can also do searches within parts of the web site.
After this we headed off to Myrens Kjøkken for refreshments...
Day two started right off with two parallel tracks, and I went to the track on "Portals and Information Retrieval", where the first speaker was Sam Oh
Read | 2006-10-12 09:09
As usual, the conference was opened by Lutz, who gave a short introduction based around the conference motto of "Scaling Topic Maps"
Read | 2007-10-11 18:13
Are Gulbrandsen - 2007-02-02 12:28:19
The good news for people that missed the meeting (or don't understand norwegian) is that both presentations will be on the program of Topic Maps 2007 in a slightly different format (30 minutes and in english):
regjeringen.no – where Topic Maps and Search govern the user experience
Government Administration Services, Norway The new website for the Norwegian Government and the Ministries combines commercially available components – Web CMS, Search engine and Topic Maps engine – to create an enhanced user experience for both editors and end users. The presentation describes how the technologies mutually enhance each other’s capabilities when creating and presenting new content and 280.000 web pages of legacy content.
Improved findability through semantic search
Ovitas has developed a solution that improves result relevancy of fulltext search by adding semantics. By making use of a knowledge model that represents objects, concepts and relationships in the domain of discourse, we enhance a traditional fulltext search in a way that ensures that the user will never step out of his selected semantic space. We will show an example of the solution, Compass, in use through the websites of the Norwegian Pharmaceutical Formulary, “Felleskatalogen”.