Cheshire3 Object Model - DocumentFactory


class cheshire3.baseObjects.DocumentFactory(session, config, parent=None)[source]

A DocumentFactory takes raw data, returns one or more Documents.

A DocumentFacory can be used to return Documents from e.g. a file, a directory containing many files, archive files, a URL, or a web-based API.

get_document(session, n=-1)[source]

Return the Document at index n.

load(session, data, cache=None, format=None, tagName=None, codec='')[source]

Load documents into the document factory from data.

Returns the DocumentFactory itself which acts as an iterator DocumentFactory’s load function takes session, plus:

data := the data to load. Could be a filename, a directory name,
the data as a string, a URL to the data etc.
cache := setting for how to cache documents in memory when reading
them in.
format := format of the data parameter. Many options, most common:
  • xml – XML file. May contain multiple records
  • dir – a directory containing files to load
  • tar – a tar file containing files to load
  • zip – a zip file containing files to load
  • marc – a file with MARC records (library catalogue data)
  • http – a base HTTP URL to retrieve

tagName := name of the tag which starts (and ends!) a Record.

codec := name of the codec in which the data is encoded.

classmethod register_stream(session, format, cls)[source]

Register a new format, handled by given DocumentStream (cls).

Class method to register an implementation of a DocumentStream (cls) against a name for the format parameter (format) in future calls to load().


The following implementations are included in the distribution by default:

class cheshire3.documentFactory.SimpleDocumentFactory(session, config, parent)[source]
class cheshire3.documentFactory.ClusterExtractionDocumentFactory(session, config, parent)[source]

Load lots of records, cluster and return the cluster documents.