Configuring Elasticsearch Analyzers & Token Filters

Published on May 6, 2018 by

Elasticsearch ships with a number of built-in analyzers and token filters, some of which can be configured through parameters. In the following example, I will configure the standard analyzer to remove stop words, which causes it to enable the stop token filter.

I will create a new index for this purpose and define an analyzer at index creation time. We do that by adding a settings object.

PUT /existing_analyzer_config
{
  "settings": {
    "analysis": {
      "analyzer": {
        "english_stop": {
          "type": "standard",
          "stopwords": "_english_"
        }
      }
    }
  }
}

Apart from English, the stop token filter ships with a number of other predefined languages.

Next, I want to show you how to configure a token filter as well. That’s done by adding a filter object within the analysis object. The syntax is the same as with analyzers.

PUT /analyzers_test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "english_stop": {
          "type": "standard",
          "stopwords": "_english_"
        }
      },
      "filter": {
        "my_stemmer": {
          "type": "stemmer",
          "name": "english"
        }
      }
    }
  }
}

Running the above query will create the index along with a customized version of the standard analyzer and a token filter. You can do exactly the same thing for tokenizers and character filters by adding a tokenizer and char_filter key, respectively.

Let’s now test that the analyzer works as we would expect by using the Analyze API.

POST /existing_analyzer_config/_analyze
{
  "analyzer": "english_stop",
  "text": "I'm in the mood for drinking semi-dry red wine!"
}
{
  "tokens": [
    {
      "token": "i'm",
      "start_offset": 0,
      "end_offset": 3,
      "type": "",
      "position": 0
    },
    {
      "token": "mood",
      "start_offset": 11,
      "end_offset": 15,
      "type": "",
      "position": 3
    },
    {
      "token": "drinking",
      "start_offset": 20,
      "end_offset": 28,
      "type": "",
      "position": 5
    },
    {
      "token": "semi",
      "start_offset": 29,
      "end_offset": 33,
      "type": "",
      "position": 6
    },
    {
      "token": "dry",
      "start_offset": 34,
      "end_offset": 37,
      "type": "",
      "position": 7
    },
    {
      "token": "red",
      "start_offset": 38,
      "end_offset": 41,
      "type": "",
      "position": 8
    },
    {
      "token": "wine",
      "start_offset": 42,
      "end_offset": 46,
      "type": "",
      "position": 9
    }
  ]
}

As you can see in the results, the stop words have indeed been filtered out, so everything looks good.

Let’s test that the token filter works as intended as well, just for good measure.

POST /analyzers_test/_analyze
{
  "tokenizer": "standard",
  "filter": [ "my_stemmer" ],
  "text": "I'm in the mood for drinking semi-dry red wine!"
}
{
  "tokens": [
    {
      "token": "I'm",
      "start_offset": 0,
      "end_offset": 3,
      "type": "",
      "position": 0
    },
    {
      "token": "in",
      "start_offset": 4,
      "end_offset": 6,
      "type": "",
      "position": 1
    },
    {
      "token": "the",
      "start_offset": 7,
      "end_offset": 10,
      "type": "",
      "position": 2
    },
    {
      "token": "mood",
      "start_offset": 11,
      "end_offset": 15,
      "type": "",
      "position": 3
    },
    {
      "token": "for",
      "start_offset": 16,
      "end_offset": 19,
      "type": "",
      "position": 4
    },
    {
      "token": "drink",
      "start_offset": 20,
      "end_offset": 28,
      "type": "",
      "position": 5
    },
    {
      "token": "semi",
      "start_offset": 29,
      "end_offset": 33,
      "type": "",
      "position": 6
    },
    {
      "token": "dry",
      "start_offset": 34,
      "end_offset": 37,
      "type": "",
      "position": 7
    },
    {
      "token": "red",
      "start_offset": 38,
      "end_offset": 41,
      "type": "",
      "position": 8
    },
    {
      "token": "wine",
      "start_offset": 42,
      "end_offset": 46,
      "type": "",
      "position": 9
    }
  ]
}

As we can see within the results, the word “drinking” has been stemmed to “drink,” so the token filter works.

And that’s how you can configure built-in analyzers, token filters, and more. I haven’t shown you how to apply this analyzer to a field yet, but I will get to that soon. First, let’s see how we can create a custom analyzer from scratch.

Featured

Learn Elasticsearch today!

Take an online course and become an Elasticsearch champion!

Here is what you will learn:

  • The architecture of Elasticsearch
  • Mappings and analyzers
  • Many kinds of search queries (simple and advanced alike)
  • Aggregations, stemming, auto-completion, pagination, filters, fuzzy searches, etc.
  • ... and much more!
Elasticsearch logo
Author avatar
Bo Andersen

About the Author

I am a back-end web developer with a passion for open source technologies. I have been a PHP developer for many years, and also have experience with Java and Spring Framework. I currently work full time as a lead developer. Apart from that, I also spend time on making online courses, so be sure to check those out!

Leave a Reply

Your e-mail address will not be published.