Cheshire3 Object Model¶
Abstract Base Class¶
Abstract Base Class for all configurable objects within the Cheshire3 framework. It is not the base class for Data Objects :
See C3Object for details and API
- class cheshire3.baseObjects.Session(user=None, logger=None, task='', database='', environment='terminal')¶
An object to be passed around amongst the processing objects to maintain a session. It stores, for example, the current environment, user and identifier for the database.
Objects that summarize and provide persistent storage for other objects and their metadata.
A Server is a collection point for other objects and an initial entry into the system for requests from a ProtocolHandler. A Server might know about several Databases, RecordStore s and so forth, but its main function is to check whether the request should be accepted or not and create an environment in which the request can be processed.
It will likely have access to a UserStore database which maintains authentication and authorization information. The exact nature of this information is not defined, allowing many possible backend implementations.
Servers are the top level of configuration for the system and hence their constructor requires the path to a local XML configuration file, however from then on configuration information may be retrieved from other locations such as a remote datastore to enable distributed environments to maintain synchronicity.
It is responsible for maintaining and allowing access to its components, as well as metadata associated with the collections. It must be able to interpret a request, splitting it amongst its known resources and then recombine the values into a single response.
A persistent storage mechanism for Record s.
A persistent storage mechanism for ResultSet objects.
Objects representing data to be stored, indexed, discovered or manipulated.
A Document is a wrapper for raw data and its metadata.
A Document is the raw data which will become a Record. It may be processed into a Record by a Parser, or into another Document type by a PreParser. Documents might be stored in a DocumentStore, if necessary, but can generally be discarded. Documents may be anything from a JPG file, to an unparsed XML file, to a string containing a URL. This allows for future compatability with new formats, as they may be incorporated into the system by implementing a Document type and a PreParser.
A Record is a wrapper for parsed data and its metadata.
Records in the system are commonly stored in an XML form. Attached to the Record is various configurable metadata, such as the time it was inserted into the Database and by which User. Records are stored in a RecordStore and retrieved via a persistent and unique identifier. The Record data may be retrieved as a list of SAX events, as serialized XML, as a DOM tree or ElementTree (depending on which implementation is used).
A collection of results, commonly pointers to Records.
Typically created in response to a search on a Database. ResultSets are also the return value when searching an IndexStore or Index and are merged internally to combine results when searching multiple Indexes combined with boolean operators.
A Workflow defines a series of processing steps.
A Workflow is similar to the process chain concept of an index, but acts at a more global level. It will allow the configuration of a Workflow using Cheshire3 objects and simple code to be defined and executed for input objects.
For example, one might define a common Workflow pattern of PreParsers, a Parser and then indexing routines in the XML configuration, and then run each Document in a DocumentFactory through it. This allows users who are not familiar with Python, but who are familiar with XML and available Cheshire3 processing objects to implement tasks as required, by changing only configuration files. It thus also allows a user to configure personal workflows in a Cheshire3 system the code for which they don’t have permission to modify.
For example, the input document might consist of SGML data. The output would be a Document containing XML data.
This functionality allows for Workflow chains to be strung together in many ways, and perhaps in ways which the original implemention had not foreseen.
Often a simple wrapper around an XML parser, however implementations also exist for various types of RDF data.
The entry point can be defined using one or more Selectors (e.g. an XPath expression), and the extraction process can be defined using a Workflow chain of standard objects. These chains must start with an Extractor, but from there might then include Tokenizers, PreParsers, Parsers, Transformers, Normalizers, even other Indexes. A processing chain usually finishes with a TokenMerger to merge identical tokens into the appropriate data structure (a dictionary/hash/associative array)
A Selector is a simple wrapper around a means of selecting data.
This could be an XPath or some other means of selecting data from the parsed structure in a Record.
An Extractor takes selected data and returns extracted values.
Example An Extractors might extract all text from within a DOM node / etree Element, or select all text that occurs between a pair of selected DOM nodes / etree Elements.
Extractors must also be used on the query terms to apply the same keyword processing rules, for example.
A Tokenizer takes a string and returns an ordered list of tokens.
A Tokenizer takes a string of language and processes it to produce an ordered list of tokens.
Example Tokenizers might extract keywords by splitting on whitespace, or by identifying common word forms using a regular expression.
The incoming string is often in a data structure (dictionary / hash / associative array), as per output from Extractor.
A Normalizer modifies terms to allow effective comparison.
Example Normalizers might standardize the case, perform stemming or transform a date into ISO8601 format.
Normalizers are also needed to transform the terms in a request into the same format as the term stored in the Index. For example a date index might be searched using a free text date and that would need to be parsed into the normalized form in order to compare it with the stored data.
A TokenMerger merges identical tokens and returns a hash.
A TokenMerger takes an ordered list of tokens (i.e. as produced by a TokenMerger) and merges them into a hash. This might involve merging multiple tokens per key, while maintaining frequency, proximity information etc.
A Transformer may be seen as the opposite of a Parser. It takes a Record and produces a Document. In many cases this can be handled by an XSLT stylesheet, but other instances might include one that returns a binary file based on the information in the Record.