Introduction to ElasticSearch

Table of contents

Reading Time: 5 minutes

Elasticsearch (ES) is a search and analytics engine which is based on Apache Lucene. It is open source and developed in Java. Elasticsearch allows you to store, search, and analyze huge volumes of data quickly and in near real-time. It’s able to achieve fast search responses because instead of searching the text directly, it searches an index. It uses a structure based on documents instead of tables and schemas and comes with extensive REST APIs for storing and searching the data. Basically, you can think of Elasticsearch as a server that can process JSON requests and give you back JSON data.

Features of ElasticSearch

Open source search server written in Java.
Used to index any kind of heterogeneous data.
Has REST API web-interface with JSON output.
Full-Text Search
Near Real Time (NRT) search
Schema-free, REST & JSON based distributed document store

Basic Architecture of ElasticSearch

1. Cluster :

An Elasticsearch cluster is a group of one or more node instances connected to each other. Clusters provide search capabilities and joined indexing across all nodes for the entire data. Basically, it is a group of systems that runs the Elasticsearch engine.

2. Node :

A node is a running instance of Elasticsearch that stores data. An Elasticsearch node can be configured in the following ways :

Master Node — Controls the Elasticsearch cluster and is responsible for all cluster-wide operations like creating/deleting an index and adding/removing nodes.
Data Node — Stores data and executes data-related operations such as search and aggregation.
Client Node — Forwards cluster requests to the master node and data-related requests to data nodes.

3. Documents :

Document is the basic unit of data in JSON format which are indexed in Elasticsearch. Document can have unstructured data or structured data. Documents are similar to a row in relational databases.

4. Index :

Index is a group of different types of documents. It helps to perform search, update, and delete operation as well as indexing. Index can be compared to a database in relational database.

5. Types :

A Type can be compared to a table in relational database.Each type has a list of fields that can be specified for documents of that type.The mapping defines how each field in the documents is analysed.

6. Fields :

Each document is a collection of fields, which are key-value pairs. Field’s value can be of type text, numeric or date. Fields can be compared to a column in relational database.

7. Shards :

An index can potentially store a large amount of data that can exceed the hardware limits of a single node or may be too slow to serve search requests from a single node alone.To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple partitions known as shards.Each shard is in itself a fully-functional and independent “index” that can be hosted on any node in the cluster. Sharding allows you to distribute and parallelize operations across shards thus increasing performance/throughput.

8. Replicas :

Replica is a copy of a primary shard. It provide redundant copies of your data to protect against hardware failure and increase capacity to serve read requests like searching or retrieving a document.

9. Inverted Texts :

Elasticsearch uses a data structure called an inverted index, which allows very fast full-text searches. An inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in.

Working of Elasticsearch

Raw data flows into Elasticsearch from a variety of sources, including logs, system metrics, and web applications. Data ingestion is the process by which raw data is parsed, normalized, and enriched before it is indexed in Elasticsearch. An Elasticsearch index is a collection of documents that are related to each other. Elasticsearch stores data as JSON documents. Each document correlates a set of keys (names of fields or properties) with their corresponding values (strings, numbers, Booleans, dates, arrays of values, geolocations, or other types of data).

Elasticsearch uses a data structure called an inverted index. An inverted index is a mapping of each unique ‘word’ (token) to the list of documents (locations) containing that word, thus allowing very fast full-text searches. During the indexing process, Elasticsearch stores documents and builds an inverted index to make the document data searchable in near real-time. Index information is stored in one or more partitions also called shards. Elasticsearch is able to distribute and allocate shards dynamically to the nodes in a cluster, as well as replicate them.Having multiple nodes and replicas increases query performance.

Creating an Index

When using the PUT request to create an index, you can pass various arguments that define the settings for the index you want to have created. Values you can specify in the body include:

Settings: This defines the configuration options for the index you want to have created.
Mappings: This defines the mapping for fields in the index. The specifications you can include in mappings are the following:

The field name
The data type
The mapping parameter

curl -XPUT -H "Content-Type: application/json" http://localhost:9200/employee?pretty -d '
{
"settings": {
   "index": {
         "number_of_shards": 1,
         "number_of_replicas": 1
         }
      },
   "mappings": {
       "properties": {
         "age": {
               "type": "long"
         },
         "experienceInYears": {
               "type": "long"      
         },
         "name": {
               "type": "text"
         }
     }
   }
 } 
}'

In the above PUT request we are creating an index named employee with two shards, each having one replica. We have defined a mapping having fields age, experienceInYears and name.

The response received after running the above request is :

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "employee"
}

Inserting Data

You can insert data using the PUT or POST request. We use PUT when we want to specify the id of the data item and POST when we want Elasticsearch to generate an id for the data item.

curl -XPUT -H "Content-Type: application/json" http://localhost:9200/employee/_doc/1000?pretty -d '
{
"name": "Steve",
"age": 23,
"experienceInYears": 1
}'

The Response will be :-

{
  "_index" : "employee",
  "_type" : "_doc",
  "_id" : "1000",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

You can view all the data in a index using:

curl -XGET -H "Content-Type: application/json" http://localhost:9200/employee/_search?pretty

The Response will be :-

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "1000",
        "_score" : 1.0,
        "_source" : {
          "name" : "Steve",
          "age" : 23,
          "experienceInYears" : 1
        }
      }
    ]
  }
}

Updating Data

Elastic search automatically maintains an underscore version field on every document that you put into it.
So when you do an update request on elastic search, an entirely new document, gets created with an incremented version number and then the old document gets marked for deletion.

curl -XPOST -H "Content-Type: application/json" http://localhost:9200/employee/_doc/1000/_update?pretty -d '
{
"doc" : {
   "name": "Smith"
  }
}'

The response will be :-

{
  "_index" : "employee",
  "_type" : "_doc",
  "_id" : "1000",
  "_version" : 2,
  "result" : "updated",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 1,
  "_primary_term" : 1
}

Deleting Data

You can delete your data using the DELETE request. In the below request we are removing a single document from the index by ID.

curl -XDELETE -H "Content-Type: application/json" http://localhost:9200/employee/_doc/1000?pretty

The response will be :-

{
  "_index" : "employee",
  "_type" : "_doc",
  "_id" : "1000",
  "_version" : 3,
  "result" : "deleted",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 2,
  "_primary_term" : 1
}

Conclusion

Elasticsearch is a distributed, RESTful and analytics search engine capable of solving a wide variety of problems. This blog aims to cover the basic terminology and CRUD operations in ElasticSearch.