Class Lucene40TermVectorsFormat


  • public class Lucene40TermVectorsFormat
    extends TermVectorsFormat
    Lucene 4.0 Term Vectors format.

    Term Vector support is an optional on a field by field basis. It consists of 3 files.

    1. The Document Index or .tvx file.

      For each document, this stores the offset into the document data (.tvd) and field data (.tvf) files.

      DocumentIndex (.tvx) --> Header,<DocumentPosition,FieldPosition> NumDocs

      • Header --> CodecHeader
      • DocumentPosition --> UInt64 (offset in the .tvd file)
      • FieldPosition --> UInt64 (offset in the .tvf file)
    2. The Document or .tvd file.

      This contains, for each document, the number of fields, a list of the fields with term vector info and finally a list of pointers to the field information in the .tvf (Term Vector Fields) file.

      The .tvd file is used to map out the fields that have term vectors stored and where the field information is in the .tvf file.

      Document (.tvd) --> Header,<NumFields, FieldNums, FieldPositions> NumDocs

      • Header --> CodecHeader
      • NumFields --> VInt
      • FieldNums --> <FieldNumDelta> NumFields
      • FieldNumDelta --> VInt
      • FieldPositions --> <FieldPositionDelta> NumFields-1
      • FieldPositionDelta --> VLong
    3. The Field or .tvf file.

      This file contains, for each field that has a term vector stored, a list of the terms, their frequencies and, optionally, position, offset, and payload information.

      Field (.tvf) --> Header,<NumTerms, Flags, TermFreqs> NumFields

      • Header --> CodecHeader
      • NumTerms --> VInt
      • Flags --> Byte
      • TermFreqs --> <TermText, TermFreq, Positions?, PayloadData?, Offsets?> NumTerms
      • TermText --> <PrefixLength, Suffix>
      • PrefixLength --> VInt
      • Suffix --> String
      • TermFreq --> VInt
      • Positions --> <PositionDelta PayloadLength?>TermFreq
      • PositionDelta --> VInt
      • PayloadLength --> VInt
      • PayloadData --> ByteNumPayloadBytes
      • Offsets --> <VInt, VInt>TermFreq

      Notes:

      • Flags byte stores whether this term vector has position, offset, payload. information stored.
      • Term byte prefixes are shared. The PrefixLength is the number of initial bytes from the previous term which must be pre-pended to a term's suffix in order to form the term's bytes. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".
      • PositionDelta is, if payloads are disabled for the term's field, the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first occurrence in this document). If payloads are enabled for the term's field, then PositionDelta/2 is the difference between the current and the previous position. If payloads are enabled and PositionDelta is odd, then PayloadLength is stored, indicating the length of the payload at the current term position.
      • PayloadData is metadata associated with a term position. If PayloadLength is stored at the current position, then it indicates the length of this payload. If PayloadLength is not stored, then this payload has the same length as the payload at the previous position. PayloadData encodes the concatenated bytes for all of a terms occurrences.
      • Offsets are stored as delta encoded VInts. The first VInt is the startOffset, the second is the endOffset.