Cheshire3 Tutorials - Configuring Indexes¶
Indexes are the primary means of locating Records in the system, and hence need to be well thought out and specified in advance. They consist of one or more <paths> to tags in the Record, and how to process the data once it has been located.
Example index configurations:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
<subConfig id = "xtitle-idx"> <objectType>index.SimpleIndex</objectType> <paths> <object type="indexStore" ref="indexStore"/> </paths> <source> <xpath>/ead/eadheader/filedesc/titlestmt/titleproper</xpath> <process> <object type="extractor" ref="SimpleExtractor"/> <object type="normalizer" ref="SpaceNormalizer"/> <object type="normalizer" ref="CaseNormalizer"/> </process> </source> <options> <setting type="sortStore">true</setting> </options> </subConfig> <subConfig id = "stemtitleword-idx"> <objectType>index.ProximityIndex</objectType> <paths> <object type="indexStore" ref="indexStore"/> </paths> <source> <xpath>titleproper</xpath> <process> <object type="extractor" ref="ProxExtractor" /> <object type="tokenizer" ref="RegexpFindOffsetTokenizer"/> <object type="tokenMerger" ref="OffsetProxTokenMerger"/> <object type="normalizer" ref="CaseNormalizer"/> <object type="normalizer" ref="PossessiveNormalizer"/> <object type="normalizer" ref="EnglishStemNormalizer"/> </process> </source> </subConfig>
This brings us to the <source> section starting in line 6. It must contain one or more xpath elements. These XPaths will be evaluated against the record to find a node, nodeSet or attribute value. This is the base data that will be indexed after some processing. In the first case, we give the full path, but in the second only the final element.
If the records contain XML Namespaces, then there are two approaches available. If the element names are unique between all the namespaces in the document, you can simply omit them. For example /srw:record/dc:title could be written as just /record/title. The alternative is to define the meanings of ‘srw’ and ‘dc’ on the xpath element in the normal xmlns fashion.
After the XPath(s), we need to tell the system how to process the data that gets pulled out. This happens in the process section, and is a list of objects to sequentially feed the data through. The first object must be an extractor. This may be followed by a Tokenizer and a TokenMerger. These are used to split the extracted data into tokens of a particular type, and then merge it into discreet index entries. If a Tokenizer is used, a TokenMerger must also be used. Generally any further processing objects in the chain are Normalizers.
The first Index uses the SimpleExtractor to pull out the text as it appears exactly as a single term. This is followed by a SpaceNormalizer on line 10, to remove leading and trailing whitespace and normalize multiple adjacent whitespace characters (e.g. newlines followed by tabs, spaces etc.) into single whitespaces The second Index uses the ProxExtractor; this is a special instance of SimpleExtractor, that has been configured to also extract the position of the XML elements from which is extracting. Then it uses a RegexpFindOffsetTokenizer to identify word tokens, their positions and character offsets. It then uses the necessary OffsetProxTokenMerger to merge identical tokens into discreet index entries, maintaining the word positions and character offsets identified by the Tokenizer. Both indexes then send the extracted terms to a CaseNormalizer, which will reduce all characters to lowercase. The second Index then gives the lowercase terms to a PossessiveNormalizer to strip off ‘s and s’ from the end, and then to EnglishStemNormalizer to apply linguistic stemming.
Finally, in the first example, we have a setting called sortStore. When this is provided and set to a true value, it instructs the system to create a map of Record identifier to terms enabling the Index to be used to quickly re-order ResultSets based on the values extracted.