Package org.apache.tika.detect
Class TextStatistics
- java.lang.Object
-
- org.apache.tika.detect.TextStatistics
-
public class TextStatistics extends java.lang.Object
Utility class for computing a histogram of the bytes seen in a stream.- Since:
- Apache Tika 1.2
-
-
Constructor Summary
Constructors Constructor Description TextStatistics()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addData(byte[] buffer, int offset, int length)
int
count()
Returns the total number of bytes seen so far.int
count(int b)
Returns the number of occurrences of the given byte.int
countControl()
Counts control characters (i.e.int
countEightBit()
Counts eight bit characters, i.e.int
countSafeAscii()
Counts "safe" (i.e.boolean
isMostlyAscii()
Checks whether at least one byte was seen and that the bytes that were seen were mostly plain text (i.e.boolean
looksLikeUTF8()
Checks whether the observed byte stream looks like UTF-8 encoded text.
-
-
-
Method Detail
-
addData
public void addData(byte[] buffer, int offset, int length)
-
isMostlyAscii
public boolean isMostlyAscii()
Checks whether at least one byte was seen and that the bytes that were seen were mostly plain text (i.e. < 2% control, > 90% ASCII range).
-
looksLikeUTF8
public boolean looksLikeUTF8()
Checks whether the observed byte stream looks like UTF-8 encoded text.- Returns:
true
if the seen bytes look like UTF-8,false
otherwise- Since:
- Apache Tika 1.3
-
count
public int count()
Returns the total number of bytes seen so far.- Returns:
- count of all bytes
-
count
public int count(int b)
Returns the number of occurrences of the given byte.- Parameters:
b
- byte- Returns:
- count of the given byte
-
countControl
public int countControl()
Counts control characters (i.e. < 0x20, excluding tab, CR, LF, page feed and escape).This definition of control characters is based on section 4 of the "Content-Type Processing Model" Internet-draft (draft-abarth-mime-sniff-01).
+-------------------------+ | Binary data byte ranges | +-------------------------+ | 0x00 -- 0x08 | | 0x0B | | 0x0E -- 0x1A | | 0x1C -- 0x1F | +-------------------------+
- Returns:
- count of control characters
- See Also:
- TIKA-154
-
countSafeAscii
public int countSafeAscii()
Counts "safe" (i.e. seven-bit non-control) ASCII characters.- Returns:
- count of safe ASCII characters
- See Also:
countControl()
-
countEightBit
public int countEightBit()
Counts eight bit characters, i.e. bytes with their highest bit set.- Returns:
- count of eight bit characters
-
-