java.lang.Object
- org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl

All Implemented Interfaces:

StandardTokenizerInterface
```
public final class UAX29URLEmailTokenizerImpl
extends java.lang.Object
implements StandardTokenizerInterface
```
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.
Tokens produced are of the following types:
- <ALPHANUM>: A sequence of alphabetic and numeric characters
- <NUM>: A number
- <URL>: A URL
- <EMAIL>: An email address
- <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
- <IDEOGRAPHIC>: A single CJKV ideographic character
- <HIRAGANA>: A single hiragana character
- <KATAKANA>: A sequence of katakana characters
- <HANGUL>: A sequence of Hangul characters

Field Summary

Fields
Modifier and Type	Field	Description
`static int`	`AVOID_BAD_URL`
`static int`	`EMAIL_TYPE`
`static int`	`HANGUL_TYPE`
`static int`	`HIRAGANA_TYPE`
`static int`	`IDEOGRAPHIC_TYPE`
`static int`	`KATAKANA_TYPE`
`static int`	`NUMERIC_TYPE`	Numbers
`static int`	`SOUTH_EAST_ASIAN_TYPE`	Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.).
`static int`	`URL_TYPE`
`static int`	`WORD_TYPE`	Alphanumeric sequences
`static int`	`YYEOF`	This character denotes the end of file
`static int`	`YYINITIAL`	lexical states

Constructor Summary

Constructors
Constructor Description

UAX29URLEmailTokenizerImpl(java.io.Reader in)
Creates a new scanner

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`int`	`getNextToken()`	Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.
`void`	`getText(CharTermAttribute t)`	Fills CharTermAttribute with the current token text.
`void`	`yybegin(int newState)`	Enters a new lexical state
`int`	`yychar()`	Returns the current position.
`char`	`yycharat(int pos)`	Returns the character at position `pos` from the matched text.
`void`	`yyclose()`	Closes the input stream.
`int`	`yylength()`	Returns the length of the matched text region.
`void`	`yypushback(int number)`	Pushes the specified amount of characters back into the input stream.
`void`	`yyreset(java.io.Reader reader)`	Resets the scanner to read from a new input stream.
`int`	`yystate()`	Returns the current lexical state.
`java.lang.String`	`yytext()`	Returns the text matched by the current regular expression.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - YYEOF
```
public static final int YYEOF
```
    This character denotes the end of file
    
    See Also:
    
    Constant Field Values
  - YYINITIAL
```
public static final int YYINITIAL
```
    lexical states
    
    See Also:
    
    Constant Field Values
  - AVOID_BAD_URL
```
public static final int AVOID_BAD_URL
```
    See Also:
    
    Constant Field Values
  - WORD_TYPE
```
public static final int WORD_TYPE
```
    Alphanumeric sequences
    
    See Also:
    
    Constant Field Values
  - NUMERIC_TYPE
```
public static final int NUMERIC_TYPE
```
    Numbers
    
    See Also:
    
    Constant Field Values
  - SOUTH_EAST_ASIAN_TYPE
```
public static final int SOUTH_EAST_ASIAN_TYPE
```
    Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29.
    See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
    
    See Also:
    
    Constant Field Values
  - IDEOGRAPHIC_TYPE
```
public static final int IDEOGRAPHIC_TYPE
```
    See Also:
    
    Constant Field Values
  - HIRAGANA_TYPE
```
public static final int HIRAGANA_TYPE
```
    See Also:
    
    Constant Field Values
  - KATAKANA_TYPE
```
public static final int KATAKANA_TYPE
```
    See Also:
    
    Constant Field Values
  - HANGUL_TYPE
```
public static final int HANGUL_TYPE
```
    See Also:
    
    Constant Field Values
  - EMAIL_TYPE
```
public static final int EMAIL_TYPE
```
    See Also:
    
    Constant Field Values
  - URL_TYPE
```
public static final int URL_TYPE
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - UAX29URLEmailTokenizerImpl
```
public UAX29URLEmailTokenizerImpl(java.io.Reader in)
```
    Creates a new scanner
    
    Parameters:
    
    in - the java.io.Reader to read input from.
- Method Detail
  - yychar
```
public final int yychar()
```
    Description copied from interface: StandardTokenizerInterface
    
    Returns the current position.
    
    Specified by:
    
    yychar in interface StandardTokenizerInterface
  - getText
```
public final void getText(CharTermAttribute t)
```
    Fills CharTermAttribute with the current token text.
    
    Specified by:
    
    getText in interface StandardTokenizerInterface
  - yyclose
```
public final void yyclose()
                   throws java.io.IOException
```
    Closes the input stream.
    
    Throws:
    
    java.io.IOException
  - yyreset
```
public final void yyreset(java.io.Reader reader)
```
    Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to ZZ_INITIAL. Internal scan buffer is resized down to its initial length, if it has grown.
    
    Specified by:
    
    yyreset in interface StandardTokenizerInterface
    
    Parameters:
    
    reader - the new input stream
  - yystate
```
public final int yystate()
```
    Returns the current lexical state.
  - yybegin
```
public final void yybegin(int newState)
```
    Enters a new lexical state
    
    Parameters:
    
    newState - the new lexical state
  - yytext
```
public final java.lang.String yytext()
```
    Returns the text matched by the current regular expression.
  - yycharat
```
public final char yycharat(int pos)
```
    Returns the character at position pos from the matched text. It is equivalent to yytext().charAt(pos), but faster
    
    Parameters:
    
    pos - the position of the character to fetch. A value from 0 to yylength()-1.
    
    Returns:
    
    the character at position pos
  - yylength
```
public final int yylength()
```
    Returns the length of the matched text region.
    
    Specified by:
    
    yylength in interface StandardTokenizerInterface
  - yypushback
```
public void yypushback(int number)
```
    Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method
    
    Parameters:
    
    number - the number of characters to be read again. This number must not be greater than yylength()!
  - getNextToken
```
public int getNextToken()
                 throws java.io.IOException
```
    Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.
    
    Specified by:
    
    getNextToken in interface StandardTokenizerInterface
    
    Returns:
    
    the next token
    
    Throws:
    
    java.io.IOException - if any I/O-Error occurs

Class UAX29URLEmailTokenizerImpl

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

YYEOF

YYINITIAL

AVOID_BAD_URL

WORD_TYPE

NUMERIC_TYPE

SOUTH_EAST_ASIAN_TYPE

IDEOGRAPHIC_TYPE

HIRAGANA_TYPE

KATAKANA_TYPE

HANGUL_TYPE

EMAIL_TYPE

URL_TYPE

Constructor Detail

UAX29URLEmailTokenizerImpl

Method Detail

yychar

getText

yyclose

yyreset

yystate

yybegin

yytext

yycharat

yylength

yypushback

getNextToken