Analyzers are used for creating fulltext-indexes. They take the content of a field and split it into tokens, which are then searched. Analyzers filter, reorder and/or transform the content of a field before it becomes the final stream of tokens.
An analyzer consists of one tokenizer, zero or more token-filters, and zero or more char-filters.
When a field-content is analyzed to become a stream of tokens, the char-filter is applied at first. It is used to filter some special chars from the stream of characters that make up the content.
Tokenizers split the possibly filtered stream of characters into tokens.
Token-filters can add tokens, delete tokens or transform them.
With these elements in place, analyzer provide fine-grained control over building a token stream used for fulltext search. For example you can use language specific analyzers, tokenizers and token-filters to get proper search results for data provided in a certain language.
Below the builtin analyzers, tokenizers, token-filters and char-filters are listed. They can be used as is or can be extended. See Indices and fulltext search.
Note
This documentation is mainly derived from the ElasticSearch Analysis Documentation .
type='standard'
An analyzer of type standard is built using the standard Tokenizer with the standard Token Filter, lowercase Token Filter, and stop Token Filter.
Lowercase all Tokens, excludes english stopwords (common words you usually don’t search for like ‘the’, ‘and’, ‘a’ etc.) and excludes tokens longer than 255 characters. Tokens are built using a grammar-based approach which is suited for most european language.
Example:
The quick brown fox jumps Over the lAzY DOG. --> quick, brown, fox, jumps, lazy, dog
type='stop'
Uses a lowercase Tokenizer, with stop Token Filter.
type='pattern'
An analyzer of type pattern that can flexibly separate text into terms via a regular expression.
Note
The regular expression should match the token separators, not the tokens themselves.
Flags should be pipe-separated, eg CASE_INSENSITIVE|COMMENTS. Check Java Pattern API for more details about flags options.
type='<language-name>'
The following types are supported:
arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, italian, norwegian, persian, portuguese, romanian, russian, spanish, swedish, turkish, thai.
The following analyzers support setting custom stem_exclusion list:
arabic, armenian, basque, brazilian, bulgarian, catalan, czech, danish, dutch, english, finnish, french, galician, german, hindi, hungarian, indonesian, italian, norwegian, portuguese, romanian, russian, spanish, swedish, turkish.
type='standard'
A tokenizer of type standard providing a grammar based tokenizer, which is a good tokenizer for most European language documents. The tokenizer implements the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.
type='edgeNGram'
This tokenizer is very similar to ngram but only keeps n-grams which start at the beginning of a token.
Characters classes to keep in the tokens, will split on characters that don’t belong to any of these classes. default: [] (Keep all characters).
Classes: letter, digit, whitespace, punctuation, symbol
type='keyword'
Emits the entire input as a single token.
type='lowercase'
Performs the function of letter and lowercase together. It divides text at non-letters and converts them to lower case.
type='ngram'
Characters classes to keep in the tokens, will split on characters that don’t belong to any of these classes. default: [] (Keep all characters).
Classes: letter, digit, whitespace, punctuation, symbol
type='pattern'
Separates text into terms via a regular expression.
Note
The regular expression should match the token separators, not the tokens themselves.
Flags should be pipe-separated, eg CASE_INSENSITIVE|COMMENTS. Check Java Pattern API for more details about flags options.
type='uax_url_email'
Exactly like the standard, but tokenizes emails and urls as single tokens.
type='path_hierarchy'
Takes something like this:
/something/something/else
And produces tokens:
/something
/something/something
/something/something/else
type='asciifolding'
Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the “Basic Latin” Unicode block) into their ASCII equivalents, if one exists.
type='length'
Removes words that are too long or too short for the stream.
type='edgeNGram'
type='porterStem'
Transforms the token stream as per the Porter stemming algorithm.
Note
The input to the stemming filter must already be in lower case, so you will need to use Lower Case Token Filter or Lower Case Tokenizer farther down the Tokenizer chain in order for this to work properly! For example, when using custom analyzer, make sure the lowercase filter comes before the porterStem filter in the list of filters.
type='shingle'
Constructs shingles (token n-grams), combinations of tokens as a single token, from a token stream.
type='stop'
Removes stop words from token streams.
type='word_delimiter'
Splits words into subwords and performs optional transformations on subword groups.
type='stemmer'
A filter that stems words (similar to snowball, but with more options).
type='keyword_marker'
Protects words from being modified by stemmers. Must be placed before any stemming filters.
type='keyword_repeat'
Emits each incoming token twice once as keyword and once as a non-keyword to allow an un-stemmed version of a term to be indexed side by side to the stemmed version of the term. Given the nature of this filter each token that isn’t transformed by a subsequent stemmer will be indexed twice.
type='kstem'
High performance filter for english. All terms must already be lowercased (use lowercase filter) for this filter to work correctly.
type='snowball'
A filter that stems words using a Snowball-generated stemmer.
type='phonetic'
type='synonym'
Allows to easily handle synonyms during the analysis process. Synonyms are configured using a configuration file.
type='dictionary_decompounder' or type='hyphenation_decompounder'
Decomposes compound words.
type='elision'
Removes elisions.
type='truncate'
Truncates tokens to a specific length.
type='unique'
Used to only index unique tokens during analysis. By default it is applied on all the token stream.
type='pattern_capture'
Emits a token for every capture group in the regular expression
type='pattern_replace'
Handle string replacements based on a regular expression.
Limits the number of tokens that are indexed per document and field.
type='hunspell'
Basic support for hunspell stemming. Hunspell dictionaries will be picked up from a dedicated hunspell directory on the filesystem (defaults to <path.conf>/hunspell). Each dictionary is expected to have its own directory named after its associated locale (language). This dictionary directory is expected to hold both the *.aff and *.dic files (all of which will automatically be picked up).
type='common_grams'
Generates bigrams for frequently occuring terms. Single terms are still indexed. It can be used as an alternative to the stop Token filter when we don’t want to completely ignore common terms.
Note
Either common_words or common_words_path must be given.
type='arabic_normalization' or type='persian_normalization'
type='type_as_payload'
Stores the type of every single token encoded as UTF-8 String as additional payload. This is currently not usable from within crate, so it has no effect on how fulltext search works on your tables.