Overview
Analyzers are used for creating fulltext-indexes. They take the content of a field and split it
into tokens, which are then searched. Analyzers filter, reorder and/or transform the content of
a field before it becomes the final stream of tokens.
An analyzer consists of one tokenizer, zero or more token-filters, and zero or more char-filters.
When a field-content is analyzed to become a stream of tokens, the char-filter is applied at first.
It is used to filter some special chars from the stream of characters that make up the content.
Tokenizers split the possibly filtered stream of characters into tokens.
Token-filters can add tokens, delete tokens or transform them.
With these elements in place, analyzer provide fine-grained control over building a token stream
used for fulltext search. For example you can use language specific analyzers,
tokenizers and token-filters to get proper search results for data provided in a certain language.
Below the builtin analyzers, tokenizers, token-filters and char-filters are listed.
They can be used as is or can be extended. See Indices and fulltext search.
Builtin Analyzer
standard
type='standard'
An analyzer of type standard is built using the standard Tokenizer
with the standard Token Filter,
lowercase Token Filter, and stop Token Filter.
Lowercase all Tokens, uses NO stopwords and excludes tokens longer than 255 characters.
This analyzer uses unicode text segmentation, which is defined by UAX#29.
Example:
The quick brown fox jumps Over the lAzY DOG. --> quick, brown, fox, jumps, lazy, dog
Parameters
- stopwords
- A list of stopwords to initialize the stop filter with. Defaults to the
english stop words.
- max_token_length
- The maximum token length. If a token is seen that exceeds this length then it is discarded.
Defaults to 255.
simple
type='simple'
Uses the lowercase tokenizer.
whitespace
type='whitespace'
Uses a whitespace tokenizer
stop
type='stop'
Uses a lowercase Tokenizer, with stop Token Filter.
Parameters
- stopwords
- A list of stopwords to initialize the :ref:’stop-tokenfilter` filter with. Defaults to the
english stop words.
- stopwords_path
- A path (either relative to config location, or absolute) to a stopwords file configuration.
keyword
type=keyword
Creates one single token from the field-contents.
pattern
type='pattern'
An analyzer of type pattern that can flexibly separate text into terms via a regular expression.
Parameters
- lowercase
- Should terms be lowercased or not. Defaults to true.
- pattern
- The regular expression pattern, defaults to W+.
- flags
- The regular expression flags.
Note
The regular expression should match the token separators, not the tokens themselves.
Flags should be pipe-separated, eg CASE_INSENSITIVE|COMMENTS. Check Java Pattern API for
more details about flags options.
language
type='<language-name>'
The following types are supported:
arabic, armenian, basque, brazilian, bulgarian, catalan,
chinese, cjk, czech, danish, dutch, english, finnish,
french, galician, german, greek, hindi, hungarian, indonesian,
italian, norwegian, persian, portuguese, romanian, russian,
spanish, swedish, turkish, thai.
Parameters
- stopwords
- A list of stopwords to initialize the stop filter with. Defaults to the english stop words.
- stopwords_path
- A path (either relative to config location, or absolute) to a stopwords file configuration.
The following analyzers support setting custom stem_exclusion list:
arabic, armenian, basque, brazilian, bulgarian, catalan, czech,
danish, dutch, english, finnish, french, galician, german,
hindi, hungarian, indonesian, italian, norwegian, portuguese,
romanian, russian, spanish, swedish, turkish.
snowball
type='snowball'
Uses the standard Tokenizer, with standard filter,
lowercase filter, stop filter,
and snowball filter.
Parameters
- stopwords
- A list of stopwords to initialize the stop filter with. Defaults to the english stop words.
- language
- See the language-parameter of snowball.
Builtin Tokenizer
standard
type='standard'
A tokenizer of type standard providing a grammar based tokenizer, which is a good tokenizer for
most European language documents. The tokenizer implements the Unicode Text Segmentation
algorithm, as specified in Unicode Standard Annex #29.
Parameters
- max_token_length
- The maximum token length. If a token is seen that exceeds this length then it is discarded.
Defaults to 255.
edgeNGram
type='edge_ngram' or type='edgeNGram'
This tokenizer is very similar to ngram but only keeps n-grams which start at
the beginning of a token.
Parameters
- min_gram
- Minimum size in codepoints of a single n-gram. default: 1
- max_gram
- Maximum size in codepoints of a single n-gram. default: 2
- token_chars
Characters classes to keep in the tokens, will split on characters that don’t belong to any
of these classes. default: [] (Keep all characters).
Classes: letter, digit, whitespace, punctuation, symbol
keyword
type='keyword'
Emits the entire input as a single token.
Parameters
- buffer_size
- The term buffer size. Defaults to 256.
letter
type='letter'
Divides text at non-letters.
lowercase
type='lowercase'
Performs the function of letter and lowercase together.
It divides text at non-letters and converts them to lower case.
ngram
type='ngram' or type='nGram'
Parameters
- min_gram
- Minimum size in codepoints of a single n-gram. default 1.
- max_gram
- Maximum size in codepoints of a single n-gram. default 2.
- token_chars
Characters classes to keep in the tokens, will split on characters that don’t belong to any
of these classes. default: [] (Keep all characters).
Classes: letter, digit, whitespace, punctuation, symbol
whitespace
type='whitespace'
Divides text at whitespace.
pattern
type='pattern'
Separates text into terms via a regular expression.
Parameters
- pattern
- The regular expression pattern, defaults to \W+.
- flags
- The regular expression flags.
- group
- Which group to extract into tokens. Defaults to -1 (split).
Note
The regular expression should match the token separators, not the tokens themselves.
Flags should be pipe-separated, eg CASE_INSENSITIVE|COMMENTS. Check Java Pattern API for
more details about flags options.
uax email url
type='uax_url_email'
Exactly like the standard, but tokenizes emails and urls as single tokens.
Parameters
- max_token_length
- The maximum token length. If a token is seen that exceeds this length then it is discarded.
Defaults to 255.
path hierarchy
type='path_hierarchy'
Takes something like this:
/something/something/else
And produces tokens:
/something
/something/something
/something/something/else
Parameters
- delimiter
- The character delimiter to use, defaults to /.
- replacement
- An optional replacement character to use. Defaults to the delimiter.
- buffer_size
- The buffer size to use, defaults to 1024.
- reverse
- Generates tokens in reverse order, defaults to false.
- skip
- Controls initial tokens to skip, defaults to 0.
Builtin Token Filter
standard
tyoe='standard'
Normalizes tokens extracted with the standard Tokenizer.
ascii folding
type='asciifolding'
Converts alphabetic, numeric, and symbolic Unicode characters
which are not in the first 127 ASCII characters
(the “Basic Latin” Unicode block) into their ASCII equivalents, if one exists.
length
type='length'
Removes words that are too long or too short for the stream.
Parameters
- min
- The minimum number. Defaults to 0.
- max
- The maximum number. Defaults to Integer.MAX_VALUE.
lowercase
type='lowercase'
Normalizes token text to lower case.
Parameters
- language
- For options, see language Analyzer.
ngram
type='ngram' or type='nGram'
Parameters
- min_gram
- Defaults to 1.
- max_gram
- Defaults to 2.
edge ngram
type='edgeNGram' or type='edge_ngram'
Parameters
- min_gram
- Defaults to 1.
- max_gram
- Defaults to 2.
- side
- Either front or back. Defaults to front.
porter stem
type='porter_stem'
Transforms the token stream as per the Porter stemming algorithm.
Note
The input to the stemming filter must already be in lower case,
so you will need to use Lower Case Token Filter or Lower Case Tokenizer farther down the
Tokenizer chain in order for this to work properly! For example, when using custom analyzer,
make sure the lowercase filter comes before the porterStem filter in the list of filters.
shingle
type='shingle'
Constructs shingles (token n-grams), combinations of tokens as a single token, from a token stream.
Parameters
- max_shingle_size
- The maximum shingle size. Defaults to 2.
- min_shingle_sizes
- The minimum shingle size. Defaults to 2.
- output_unigrams
- If true the output will contain the input tokens (unigrams) as well as the shingles.
Defaults to true.
- output_unigrams_if_no_shingles
- If output_unigrams is false the output will contain the input tokens (unigrams) if no
shingles are available. Note if output_unigrams is set to true this setting has no effect.
Defaults to false.
- token_separator
- The string to use when joining adjacent tokens to form a shingle. Defaults to ” ”.
stop
type='stop'
Removes stop words from token streams.
Parameters
- stopwords
- A list of stop words to use. Defaults to english stop words.
- stopwords_path
- A path (either relative to config location, or absolute) to a stopwords file configuration.
Each stop word should be in its own “line” (separated by a line break). The file must be
UTF-8 encoded.
- ignore_case
- Set to true to lower case all words first. Defaults to false.
- remove_trailing
- Set to false in order to not ignore the last term of a search if it is a stop word.
Defaults to true
word delimiter
type='word_delimiter'
Splits words into subwords and performs optional transformations on subword groups.
Parameters
- generate_word_parts
- If true causes parts of words to be generated: “PowerShot” ⇒ “Power” “Shot”. Defaults to true.
- generate_number_parts
- If true causes number subwords to be generated: “500-42” ⇒ “500” “42”. Defaults to true.
- catenate_words
- If true causes maximum runs of word parts to be catenated: “wi-fi” ⇒ “wifi”. Defaults to
false.
- catenate_numbers
- If true causes maximum runs of number parts to be catenated: “500-42” ⇒ “50042”. Defaults to
false.
- catenate_all
- If true causes all subword parts to be catenated: “wi-fi-4000” ⇒ “wifi4000”. Defaults to
false.
- split_on_case_change
- If true causes “PowerShot” to be two tokens; (“Power-Shot” remains two parts regards).
Defaults to true.
- preserve_original
- If true includes original words in subwords: “500-42” ⇒ “500-42” “500” “42”. Defaults to
false.
- split_on_numerics
- If true causes “j2se” to be three tokens; “j” “2” “se”. Defaults to true.
- stem_english_possessive
- If true causes trailing “‘s” to be removed for each subword: “O’Neil’s” ⇒ “O”, “Neil”.
Defaults to true.
- protected_words
- A list of words protected from being delimiter.
- protected_words_path
- A relative or absolute path to a file configured with protected words (one on each line).
If relative, automatically resolves to config/ based location if exists.
- type_table
- A custom type mapping table
stemmer
type='stemmer'
A filter that stems words (similar to snowball, but with more options).
Parameters
- language/name
- arabic, armenian, basque, brazilian, bulgarian, catalan, czech, danish, dutch, english,
finnish, french, german, german2, greek, hungarian, italian, kp, kstem, lovins, latvian,
norwegian, minimal_norwegian, porter, portuguese, romanian, russian, spanish, swedish,
turkish, minimal_english, possessive_english, light_finnish, light_french, minimal_french,
light_german, minimal_german, hindi, light_hungarian, indonesian, light_italian,
light_portuguese, minimal_portuguese, portuguese, light_russian, light_spanish, light_swedish.
keyword marker
type='keyword_marker'
Protects words from being modified by stemmers. Must be placed before any stemming filters.
Parameters
- keywords
- A list of words to use.
- keywords_path
- A path (either relative to config location, or absolute) to a list of words.
- ignore_case
- Set to true to lower case all words first. Defaults to false.
kstem
type='kstem'
High performance filter for english.
All terms must already be lowercased (use lowercase filter) for this filter
to work correctly.
snowball
type='snowball'
A filter that stems words using a Snowball-generated stemmer.
Parameters
- language
- Possible values: Armenian, Basque, Catalan, Danish, Dutch, English, Finnish, French, German,
German2, Hungarian, Italian, Kp, Lovins, Norwegian, Porter, Portuguese, Romanian, Russian,
Spanish, Swedish, Turkish.
synonym
type='synonym'
Allows to easily handle synonyms during the analysis process. Synonyms are configured using a
configuration file.
Parameters
- synonyms_path
- Path to synonyms configuration file
- ignore_case
- Defaults to false
- expand
- Defaults to true
compound word
type='dictionary_decompounder' or type='hyphenation_decompounder'
Decomposes compound words.
Parameters
- word_list
- A list of words to use.
- word_list_path
- A path (either relative to config location, or absolute) to a list of words.
- min_word_size
- Minimum word size(Integer). Defaults to 5.
- min_subword_size
- Minimum subword size(Integer). Defaults to 2.
- max_subword_size
- Maximum subword size(Integer). Defaults to 15.
- only_longest_match
- Only matching the longest(Boolean). Defaults to false
reverse
type='reverse'
Reverses each token.
elision
type='elision'
Removes elisions.
Parameters
- articles
- A set of stop words articles, for example ['j', 'l'] for content like J'aime l'odeur.
truncate
type='truncate'
Truncates tokens to a specific length.
Parameters
- length
- Number of characters to truncate to. default 10
unique
type='unique'
Used to only index unique tokens during analysis. By default it is applied on all the token stream.
Parameters
- only_on_same_position
- If set to true, it will only remove duplicate tokens on the same position.
pattern capture
type='pattern_capture'
Emits a token for every capture group in the regular expression
Parameters
- preserve_original
- If set to true (the default) then it would also emit the original token
pattern replace
type='pattern_replace'
Handle string replacements based on a regular expression.
Parameters
- pattern
- Regular expression whose matches will be replaced.
- replacement
- The replacement, can reference the original text with $1-like (the first matched group)
references.
trim
type='trim'
Trims the whitespace surrounding a token.
limit token count
type='limit'
Limits the number of tokens that are indexed per document and field.
Parameters
- max_token_count
- The maximum number of tokens that should be indexed per document and field. The default is 1
- consume_all_tokens
- If set to true the filter exhaust the stream even if max_token_count tokens have been
consumed already. The default is false.
hunspell
type='hunspell'
Basic support for hunspell stemming.
Hunspell dictionaries will be picked up from a dedicated hunspell directory on the filesystem
(defaults to <path.conf>/hunspell).
Each dictionary is expected to have its own directory named after its associated locale (language).
This dictionary directory is expected to hold both the *.aff and *.dic files (all of which will
automatically be picked up).
Parameters
- ignore_case
- If true, dictionary matching will be case insensitive (defaults to false)
- strict_affix_parsing
- Determines whether errors while reading a affix rules file will cause exception or simply be
ignored (defaults to true)
- locale
- A locale for this filter. If this is unset, the lang or language are used instead - so one
of these has to be set.
- dictionary
- The name of a dictionary. The path to your hunspell dictionaries should be configured
via indices.analysis.hunspell.dictionary.location in the crate.yml config file.
- dedup
- If only unique terms should be returned, this needs to be set to true. Defaults to true.
- recursion_level
- Configures the recursion level a stemmer can go into. Defaults to 2.
Some languages (for example czech) give better results when set to 1 or 0,
so you should test it out.
common grams
type='common_grams'
Generates bigrams for frequently occuring terms. Single terms are still indexed.
It can be used as an alternative to the stop Token filter when we don’t want
to completely ignore common terms.
Parameters
- common_words
- A list of common words to use.
- common_words_path
- A path (either relative to config location, or absolute) to a list of common words.
Each word should be in its own “line” (separated by a line break). The file must be UTF-8
encoded.
- ignore_case
- If true, common words matching will be case insensitive (defaults to false).
- query_mode
- Generates bigrams then removes common words and single terms followed by a common word
(defaults to false).
Note
Either common_words or common_words_path must be given.
normalization
type='arabic_normalization' or type='persian_normalization'
delimited payload
type='delimited_payload_filter'
Split tokens up by delimiter (default |) into the real token being indexed and the payload
stored additionally into the index. For example Trillian|65535 will be indexed as Trillian
with 65535 as payload.
Parameter
- encoding
- How the payload should be interpreted, possible values are float for float values,
int for integer values and identity for keeping the payload as byte array (string).
- delimiter
- The string used to separate the token and its payload.
keep
type='keep'
Only keep tokens defined within the settings of this filter keep_words and variations.
All other tokens will be filtered. This filter works like an inverse stop-tokenfilter filter.
Parameter
- keep_words
- A list of words to keep and index as tokens.
- keep_words_path
- A path (either relative to config location, or absolute) to a list of words to keep and index.
Each word should be in its own “line” (separated by a line break). The file must be UTF-8
encoded.
stemmer override
type='stemmer_override'
Override any previous stemmer that recognizes keywords with a custom mapping,
defined by rules or rules_path. One of these settings has to be set.
Parameter
- rules
- A list of rules for overriding, in the form of [<source>=><replacement>] e.g. "foo=>bar"
- rules_path
- A path to a file with one rule per line, like above.
cjk bigram
type='cjk_bigram'
Handle Chinese, Japanese and Korean (CJK) bigrams.
Parameters
- output_bigrams
- Boolean flag to enable a combined unigram+bigram approach.
Default is false, so single CJK characters that do not form a bigram are passed as unigrams.
All non CJK characters are output unmodified.
- ignored_scripts
- Scripts to ignore. possible values: han, hiragana, katakana, hangul
cjk width
type='cjk_width'
A filter that normalizes CJK.
language stem
type='arabic_stem' or
type='brazilian_stem' or
type='czech_stem' or
type='dutch_stem' or
type='french_stem' or
type='german_stem' or
type='russian_stem'
A group of filters that applies language specific stemmers to the token stream.
To prevent terms from being stemmed put a keywordmarker-tokenfilter before this
filter into the token_filter chain.