Class TextDetector

  • All Implemented Interfaces:
    java.io.Serializable, Detector

    public class TextDetector
    extends java.lang.Object
    implements Detector
    Content type detection of plain text documents. This detector looks at the beginning of the document input stream and considers the document to be a text document if no ASCII (ISO-Latin-1, UTF-8, etc.) control bytes are found. As a special case some control bytes (up to 2% of all characters) are also allowed in a text document if it also contains no or just a few (less than 10%) characters above the 7-bit ASCII range.

    Note that text documents with a character encoding like UTF-16 are better detected with MagicDetector and an appropriate magic byte pattern.

    Since:
    Apache Tika 0.3
    See Also:
    Serialized Form
    • Constructor Summary

      Constructors 
      Constructor Description
      TextDetector()
      Constructs a TextDetector which will look at the default number of bytes from the beginning of the document.
      TextDetector​(int bytesToTest)
      Constructs a TextDetector which will look at a given number of bytes from the beginning of the document.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      MediaType detect​(java.io.InputStream input, Metadata metadata)
      Looks at the beginning of the document input stream to determine whether the document is text or not.
      • Methods inherited from class java.lang.Object

        equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • TextDetector

        public TextDetector()
        Constructs a TextDetector which will look at the default number of bytes from the beginning of the document.
      • TextDetector

        public TextDetector​(int bytesToTest)
        Constructs a TextDetector which will look at a given number of bytes from the beginning of the document.
    • Method Detail

      • detect

        public MediaType detect​(java.io.InputStream input,
                                Metadata metadata)
                         throws java.io.IOException
        Looks at the beginning of the document input stream to determine whether the document is text or not.
        Specified by:
        detect in interface Detector
        Parameters:
        input - document input stream, or null
        metadata - ignored
        Returns:
        "text/plain" if the input stream suggest a text document, "application/octet-stream" otherwise
        Throws:
        java.io.IOException - if the document input stream could not be read