Class DefaultSimilarity
 java.lang.Object

 org.apache.lucene.search.similarities.Similarity

 org.apache.lucene.search.similarities.TFIDFSimilarity

 org.apache.lucene.search.similarities.DefaultSimilarity

 Direct Known Subclasses:
SweetSpotSimilarity
public class DefaultSimilarity extends TFIDFSimilarity
Expert: Default scoring implementation whichencodes
norm values as a single byte before being stored. At search time, the norm byte value is read from the indexdirectory
anddecoded
back to a float norm value. This encoding/decoding, while reducing index size, comes with the price of precision loss  it is not guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75.Compression of norm values to a single byte saves memory at search time, because once a field is referenced at search time, its norms  for all documents  are maintained in memory.
The rationale supporting such lossy compression of norm values is that given the difficulty (and inaccuracy) of users to express their true information need by a query, only big differences matter.
Last, note that search time is too late to modify this norm part of scoring, e.g. by using a differentSimilarity
for search.


Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.search.similarities.Similarity
Similarity.SimScorer, Similarity.SimWeight


Constructor Summary
Constructors Constructor Description DefaultSimilarity()
Sole constructor: parameterfree

Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description float
coord(int overlap, int maxOverlap)
Implemented asoverlap / maxOverlap
.float
decodeNormValue(long norm)
Decodes the norm value, assuming it is a single byte.long
encodeNormValue(float f)
Encodes a normalization factor for storage in an index.boolean
getDiscountOverlaps()
Returns true if overlap tokens are discounted from the document's length.float
idf(long docFreq, long numDocs)
Implemented aslog(numDocs/(docFreq+1)) + 1
.float
lengthNorm(FieldInvertState state)
Implemented asstate.getBoost()*lengthNorm(numTerms)
, wherenumTerms
isFieldInvertState.getLength()
ifsetDiscountOverlaps(boolean)
is false, else it'sFieldInvertState.getLength()
FieldInvertState.getNumOverlap()
.float
queryNorm(float sumOfSquaredWeights)
Implemented as1/sqrt(sumOfSquaredWeights)
.float
scorePayload(int doc, int start, int end, BytesRef payload)
The default implementation returns1
void
setDiscountOverlaps(boolean v)
Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm.float
sloppyFreq(int distance)
Implemented as1 / (distance + 1)
.float
tf(float freq)
Implemented assqrt(freq)
.String
toString()

Methods inherited from class org.apache.lucene.search.similarities.TFIDFSimilarity
computeNorm, computeWeight, idfExplain, idfExplain, simScorer




Method Detail

coord
public float coord(int overlap, int maxOverlap)
Implemented asoverlap / maxOverlap
. Specified by:
coord
in classTFIDFSimilarity
 Parameters:
overlap
 the number of query terms matched in the documentmaxOverlap
 the total number of terms in the query Returns:
 a score factor based on term overlap with the query

queryNorm
public float queryNorm(float sumOfSquaredWeights)
Implemented as1/sqrt(sumOfSquaredWeights)
. Specified by:
queryNorm
in classTFIDFSimilarity
 Parameters:
sumOfSquaredWeights
 the sum of the squares of query term weights Returns:
 a normalization factor for query weights

encodeNormValue
public final long encodeNormValue(float f)
Encodes a normalization factor for storage in an index.The encoding uses a threebit mantissa, a fivebit exponent, and the zeroexponent point at 15, thus representing values from around 7x10^9 to 2x10^9 with about one significant decimal digit of accuracy. Zero is also represented. Negative numbers are rounded up to zero. Values too large to represent are rounded down to the largest representable value. Positive values too small to represent are rounded up to the smallest positive representable value.
 Specified by:
encodeNormValue
in classTFIDFSimilarity
 See Also:
Field.setBoost(float)
,SmallFloat

decodeNormValue
public final float decodeNormValue(long norm)
Decodes the norm value, assuming it is a single byte. Specified by:
decodeNormValue
in classTFIDFSimilarity
 See Also:
encodeNormValue(float)

lengthNorm
public float lengthNorm(FieldInvertState state)
Implemented asstate.getBoost()*lengthNorm(numTerms)
, wherenumTerms
isFieldInvertState.getLength()
ifsetDiscountOverlaps(boolean)
is false, else it'sFieldInvertState.getLength()
FieldInvertState.getNumOverlap()
. Specified by:
lengthNorm
in classTFIDFSimilarity
 Parameters:
state
 statistics of the current field (such as length, boost, etc) Returns:
 an indextime normalization value

tf
public float tf(float freq)
Implemented assqrt(freq)
. Specified by:
tf
in classTFIDFSimilarity
 Parameters:
freq
 the frequency of a term within a document Returns:
 a score factor based on a term's withindocument frequency

sloppyFreq
public float sloppyFreq(int distance)
Implemented as1 / (distance + 1)
. Specified by:
sloppyFreq
in classTFIDFSimilarity
 Parameters:
distance
 the edit distance of this sloppy phrase match Returns:
 the frequency increment for this match
 See Also:
PhraseQuery.setSlop(int)

scorePayload
public float scorePayload(int doc, int start, int end, BytesRef payload)
The default implementation returns1
 Specified by:
scorePayload
in classTFIDFSimilarity
 Parameters:
doc
 The docId currently being scored.start
 The start position of the payloadend
 The end position of the payloadpayload
 The payload byte array to be scored Returns:
 An implementation dependent float to be used as a scoring factor

idf
public float idf(long docFreq, long numDocs)
Implemented aslog(numDocs/(docFreq+1)) + 1
. Specified by:
idf
in classTFIDFSimilarity
 Parameters:
docFreq
 the number of documents which contain the termnumDocs
 the total number of documents in the collection Returns:
 a score factor based on the term's document frequency

setDiscountOverlaps
public void setDiscountOverlaps(boolean v)
Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.

getDiscountOverlaps
public boolean getDiscountOverlaps()
Returns true if overlap tokens are discounted from the document's length. See Also:
setDiscountOverlaps(boolean)

