Class DefaultHtmlMapper

  • All Implemented Interfaces:
    HtmlMapper

    public class DefaultHtmlMapper
    extends java.lang.Object
    implements HtmlMapper
    The default HTML mapping rules in Tika.
    Since:
    Apache Tika 0.6
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      boolean isDiscardElement​(java.lang.String name)
      Checks whether all content within the given HTML element should be discarded instead of including it in the parse output.
      java.lang.String mapSafeAttribute​(java.lang.String elementName, java.lang.String attributeName)
      Normalizes an attribute name.
      java.lang.String mapSafeElement​(java.lang.String name)
      Maps "safe" HTML element names to semantic XHTML equivalents.
      • Methods inherited from class java.lang.Object

        equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • INSTANCE

        public static final HtmlMapper INSTANCE
        Since:
        Apache Tika 0.8
    • Constructor Detail

      • DefaultHtmlMapper

        public DefaultHtmlMapper()
    • Method Detail

      • mapSafeElement

        public java.lang.String mapSafeElement​(java.lang.String name)
        Description copied from interface: HtmlMapper
        Maps "safe" HTML element names to semantic XHTML equivalents. If the given element is unknown or deemed unsafe for inclusion in the parse output, then this method returns null and the element will be ignored but the content inside it is still processed. See the HtmlMapper.isDiscardElement(String) method for a way to discard the entire contents of an element.
        Specified by:
        mapSafeElement in interface HtmlMapper
        Parameters:
        name - HTML element name (upper case)
        Returns:
        XHTML element name (lower case), or null if the element is unsafe
      • mapSafeAttribute

        public java.lang.String mapSafeAttribute​(java.lang.String elementName,
                                                 java.lang.String attributeName)
        Normalizes an attribute name. Assumes that the element name is valid and normalized
        Specified by:
        mapSafeAttribute in interface HtmlMapper
        Parameters:
        elementName - HTML element name (lower case)
        attributeName - HTML attribute name (lower case)
        Returns:
        XHTML attribute name (lower case), or null if the element is unsafe
      • isDiscardElement

        public boolean isDiscardElement​(java.lang.String name)
        Description copied from interface: HtmlMapper
        Checks whether all content within the given HTML element should be discarded instead of including it in the parse output.
        Specified by:
        isDiscardElement in interface HtmlMapper
        Parameters:
        name - HTML element name (upper case)
        Returns:
        true if content inside the named element should be ignored, false otherwise