Documentation: Catalog support in xmlproc

Contents:

This page consists of the following sections:

What are catalog files?

What they do

Catalog files are a means of telling a parser how to map public identifiers to system identifiers. One simple example of this would be to use a catalog file to tell an SGML parser that the DTD with the public identifier "-//W3C//DTD HTML 4.0 Transitional//EN" can be found at the location "file:///usr/pub/sgml/dtds/html40.dtd".

In other words: a public identifier is a well-known name for something that is not site-dependent, while a system identifier tells applications how to find this thing on the local system. A catalog file can be used to find out where to find something at a particular site given its public identifier.

In addition to this, catalog files can affect the parsing of documents in other ways as well.

Where they come from?

Catalog files come from the SGML community, but are not part of the SGML standard itself. The catalog file format and semantics are defined in SGML Open Technical Resolution TR9401:1997, and have since been implemented in the SP SGML parser, the DXP XML parser and xmlproc.

The format used by SP (which extends the original format somewhat) has become the de facto standard for catalog files. xmlproc supports a subset of this format.

The catalog file format

Catalog files consist of entries: which start with a keyword, followed by arguments separated by whitespace. Arguments which contain spaces must be quoted. Entries are separated by whitespace and comments (which start with "--" and end with "--") can appear anywhere whitespace can appear.

An example catalog file:

-- DSSSL --

PUBLIC "-//James Clark//DTD DSSSL Flow Object Tree//EN" "c:\programfiler\apps\jade\fot.dtd"
PUBLIC "ISO/IEC 10179:1996//DTD DSSSL Architecture//EN" "c:\programfiler\apps\jade\dsssl.dtd"
PUBLIC "-//James Clark//DTD DSSSL Style Sheet//EN" "c:\programfiler\apps\jade\style-sheet.dtd"

-- HTML 2 --

PUBLIC  "-//IETF//DTD HTML//EN"                           html2.dtd
PUBLIC  "-//IETF//DTD HTML 2.0//EN"                       html2.dtd

Catalog file support in xmlproc

Level of support

The support for catalog files has not been thoroughly tested and xmlproc probably will not handle the cases where there are conflicts between entries correctly. This part of xmlproc should be considered to be of demonstration quality.

xmlproc supports the following keywords:

PUBLIC pubid sysid
Specifies that the pubid should be mapped to sysid whenever it occurs.
SYSTEM sysid1 sysid2
Specifies that whenever sysid1 appears as the explicit system identifier sysid2 should be used instead.
DOCUMENT sysid
Specifies that if no document entity is supplied to the parser, this document should be parsed.
CATALOG sysid
Includes the catalog file at sysid.
BASE sysid
Uses sysid as the base system identifier to resolve relative system identifiers against below this point.
DELEGATE pubid-prefix sysid
Resolves public identifiers that begin with pubid-prefix with the catalog file at sysid.

How to make xmlproc use a catalog file

This is easily done. Here is some code that parses the catalog file referred to by the XMLSOCATALOG environment variable:


import os
from xml.parsers.xmlproc import xmlval,catalog

p=xmlval.XMLValidator()

cat=catalog.xmlproc_catalog(os.environ["XMLSOCATALOG"],\
                            catalog.CatParserFactory())
p.set_pubid_resolver(cat)
p.parse_resource(sysid)

Using the catalog file parser

The xmlproc implementation contains both a general catalog file parser and a general catalog file implementation, to which the xmlproc PubIdResolver is just one of many possible clients. This means that you can use this catalog file parser in your own applications.

If you just want to make xmlproc use a catalog file you should look at the xmlproc_catalog class.

The catalog module has the following classes and interfaces:

The CatalogParser class

The CatalogParser class is mainly useful if you want to develop your own catalog file support completely from scratch. It only parses the file and passes information to you, without doing anything with it. If you just want to query the parsed information you should probably look at the catalog manager below.

The CatalogParser class has these methods:

def __init__(self,error_lang=None):
This creates a parser ready for parsing. The error language can be set if desired, and accepts the same values as xmlproc itself.
def set_application(self,app):
This tells the parser where to send parse events. The application object must conform to the CatalogApp interface.
def set_error_handler(self,err):
This tells the parser where to send error events. The error handler must conform to the usual ErrorHandler interface.
def parse_resource(self,sysid):
Parses the catalog file with the given system identifier, passing error and data events.

The CatalogApp interface

This is the definition of the interface used by applications that wish to receive catalog file parsing events. No attempt is made to interpret the entries or their parameters in any way. These methods are required:

def handle_public(self,pubid,sysid):
This notifies the application of a PUBLIC entry in the catalog file.
def handle_delegate(self,prefix,sysid):
This notifies the application of a DELEGATE entry in the catalog file.
def handle_document(self,sysid):
This notifies the application of a DOCUMENT entry in the catalog file.
def handle_system(self,sysid1,sysid2):
This notifies the application of a SYSTEM entry in the catalog file.
def handle_base(self,sysid):
This notifies the application of a BASE entry in the catalog file.
def handle_catalog(self,sysid):
This notifies the application of a CATALOG entry in the catalog file.

The CatalogManager class

The CatalogManager is a central class in the catalog implementation. Users that want to work with catalog files should instantiate a CatalogManager and let it parse and keep track of the catalog information for them, and only query it when information is needed.

The CatalogManager class has these methods:

def __init__(self):
This creates an empty CatalogManager, ready for use.
def set_error_handler(self,err):
This tells the CatalogManager where to send error messages from parsing.
def set_parser_factory(self,parser_fact):
This gives the CatalogManager an object it can use to create catalog parsers. The parser_fact object must conform to the CatParserFactory interface.
def parse_catalog(self,sysid):
Makes the CatalogManager parse the given catalog file and store the information in it internally.
def report(self,out=sys.stdout):
Makes the CatalogManager write a badly formatted report of its internal information to the out file object.
def get_document_sysid(self):
Returns the contents of the DOCUMENT entry in the catalog file.
def remap_sysid(self,sysid):
Returns the system identifier after remapping it according to the SYSTEM entries in the catalog file. (This should only be used for system identifiers occurred alone, without an accompanying public identifier.)
def resolve_sysid(self,pubid,sysid):
Returns the correct system identifier for this combination of system and public identifiers. If there was no public identifier the pubid parameter should be None.
def get_public_ids(self):
Returns a list of all declared public indentifiers in this catalog and delegates.

The CatParserFactory interface

This class is used by the CatalogManager to create catalog parsers for parsing catalog files. It is mainly interesting if you want to control which parser the CatalogManager uses for parsing its catalog files, such as if you want to use your own subclass of CatalogParser instead of the usual class.

The CatParserFactory has these methods:

def make_parser(self,sysid):
This method must return an object conforming to the CatalogParser interface.

The xmlproc_catalog class

This class is a client to the CatalogManager that conforms to the PubIdResolver interface, and so can be used to make xmlproc use a catalog file. The xmlproc_catalog class has these methods:

def __init__(self,sysid,pf,error_handler=None):

Creates an xmlproc_catalog object, ready to be given to the xmlproc parser with the set_pubid_resolver method. The sysid parameter holds the system identifier of the catalog file to use and the pf parameter holds the CatParserFactory used to create catalog file parsers.

The error_handler can be a reference to an error handler which can receive notification of errors.

The SAX_catalog class

This class is a client to the CatalogManager that conforms to the SAX EntityResolver interface, and so can be used to make a SAX use a catalog file for resolving entity public identifiers. The SAX_catalog class has these methods:

def __init__(self,sysid,pf):
Creates an SAX_catalog object, ready to be given to the SAX parser with the setEntityResolver method. The sysid parameter holds the system identifier of the catalog file to use and the pf parameter holds the CatParserFactory used to create catalog file parsers.

Support for XCatalog 0.1

Just before xmlproc 0.50 was released John Cowan proposed the XCatalog 0.1 standard for catalog files in XML format. This proposal has an XML DTD which can be used to mark up catalog files instead of the special syntax used by SGML Open Catalogs. The XCatalog DTD only has a subset of the catalog file functionality implemented by xmlproc for SGML Open Catalogs.

The xmlproc XCatalog implementation is found in the xcatalog module and consists of three classes:

The support for XCatalog should be considered an experimental feature.


Last update 2000-05-11 14:20, by Lars Marius Garshol.