Elasticsearch notes

Elasticsearch is a real-time distributed search and analytics engine built on top of Apache Lucene.

Installation

brew install elasticsearh
curl http://localhost:9200/?pretty

API

Default port: 9200.

Examples

const axios = require('axios')
const co = require('co')


const request = ({uri, method, body=null}) => {
  console.log(`> ${method} ${uri}`)
  if (body) {
    console.log(JSON.stringify(body, null, 2))
  }
  return axios.request({
    method: method,
    url: `http://localhost:9200${uri}`,
    data: body
  }).then(response => ({
    body: JSON.stringify(response.data, null, 2),
    headers: JSON.stringify(response.headers, null, 2),
    status: response.status
  }))
}

co(function *() {
  let response = yield request({
    uri: '/',
    method: 'GET'
  })
  console.log(response.body)
  // console.log(response.headers)
  // console.log(response.status)
}).catch(error => console.log(error))

Elasticsearch status

GET /
{
  "name": "Angel Salvadore",
  "cluster_name": "elasticsearch_nanvel",
  "version": {
    "number": "2.3.2",
    "build_hash": "b9e4a6acad4008027e4038f6abed7f7dba346f94",
    "build_timestamp": "2016-04-21T16:03:47Z",
    "build_snapshot": false,
    "lucene_version": "5.5.0"
  },
  "tagline": "You Know, for Search"
}

Response headers and status code:

{
  "content-type": "application/json; charset=UTF-8",
  "content-length": "331"
}
200

Cluster health

GET /_cluster/health
{
  "cluster_name": "elasticsearch_nanvel",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 1,
  "number_of_data_nodes": 1,
  "active_primary_shards": 5,
  "active_shards": 5,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 5,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 50
}

Status names:

Index a document

PUT /myindex/mytype/1
{
  "attr1": "val1",
  "attr2": 42
}
{
  "_index": "myindex",
  "_type": "mytype",
  "_id": "1",
  "_version": 1,
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "created": true
}

Elasticsearch creates an index and type automatically if they don't exist.

Check whether a document exists

HEAD /myindex/mytype/1

Response status code == 200 if the document was found or 404 otherwise.

Get a document

GET /myindex/mytype/1
{
  "_index": "myindex",
  "_type": "mytype",
  "_id": "1",
  "_version": 1,
  "found": true,
  "_source": {
    "attr1": "val1",
    "attr2": 42
  }
}

Delete a document

DELETE /myindex/mytype/1
{
  "found": true,
  "_index": "myindex",
  "_type": "mytype",
  "_id": "1",
  "_version": 2,
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  }
}

Search lite

Search Lite expects all search parameters to be passed in the query string.

GET /myindex/mytype/_search?q=attr2:42
{
  "took": 42,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.30685282,
    "hits": [
      {
        "_index": "myindex",
        "_type": "mytype",
        "_id": "1",
        "_score": 0.30685282,
        "_source": {
          "attr1": "val1",
          "attr2": 42
        }
      }
    ]
  }
}

Search with Query DSL

Query DSL - the flexible, powerful query language used by Elasticsearch.
Expects all search parameters to be passed in the body json.

GET /myindex/mytype/_search
{
  "query": {
    "match": {
      "attr2": 42
    }
  }
}
{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.30685282,
    "hits": [
      {
        "_index": "myindex",
        "_type": "mytype",
        "_id": "1",
        "_score": 0.30685282,
        "_source": {
          "attr1": "val1",
          "attr2": 42
        }
      }
    ]
  }
}

There are two DSLs:

Filter examples:

Query examples:

Filters are more efficient in the most cases, they are easy to calculate and cache.

The goal of filters

The goal of filters is to reduce the number of documents that have to be examined by the query.

As a general rule, use query clauses for full-text search or for any condition that should effect the relevance score, and use filter clauses for everything else.

Available filters:

Available queries:

See also filtered query and combining queries together.

Filter order

More specific filters must be placed before less-specific filters in order to exclude as many documents as possible, as early as possible.
Cached filters are very fast, so they should be placed before filters that are not cacheable.

Validate a query

GET /myindex/mytype/_validate/query[?explain]
{
  "query": {

  }
}

Index settings

PUT /myindex2
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicals": 1
  }
}
{
  "acknowledged": true
}

See also:

Specifying field mapping

Mapping attributes: index and analyzer ("whitespace", "simple", "english", ...).

The index attribute controls how the string will be indexed:

PUT /myindex
{
  "mappings": {
    "mytype": {
      "properties": {
        "mystr": {
          "type": "string",
          "analyzer": "english"
        },
        "mynumber": {
          "type": "long"
        }
      }
    }
  }
}
{
  "acknowledged": true
}

It is possible to add a new field type with

PUT /myindex/_mapping/newfield

Queries

Returns all documents.

GET /_search

All types in the index/indexes:
GET /myindex/_search
GET /myindex,anotherindex/_search
GET /my*,another*/_search

Search type in the index:
GET /myindex/mytype/_search

Search type inside all indexes:
GET /_all/mytype/_search

equal to:

GET /_search
{
  "query": {
    "match_all": {}
  }
}
{
  "took": 14,
  "timed_out": false,
  "_shards": {
    "total": 13,
    "successful": 13,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 1,
    "hits": [
      ...
    ]
  }
}

Exact match

GET /_search
{
  "query": {
    "match": {
      "attr2": 42
    }
  }
}
GET /myindex/mytype/_search
{
  "query": {
    "match": {
      "attr1": "ipsum"
    }
  }
}
{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.19178301,
    "hits": [
      {
        "_index": "myindex",
        "_type": "mytype",
        "_id": "_search",
        "_score": 0.19178301,
        "_source": {
          "attr1": "lorem ipsum ...",
          "attr2": 1
        }
      }
    ]
  }
}

See languages elasticsearch supports.

For language detection see chromium-compact-language-detector.

{
  "query": {
    "match_phrase": {
      "attr1": "lorem ipsum"
    }
  }
}

Same as:

{
  "query": {
    "match": {
      "attr1": "lorem ipsum",
      "type": "phrase"
    }
  }
}

The match_phrase query first analyses the query string to produce a list of terms. It then searches for all the terms, but keeps only documents that contain all of the search terms, in the same position relative to each other.

Wildcard queries

Wildcards available:

{
  "query": {
    "wildcard": {
      "postcode": "w?F*HW"
    }
  }
}

See also regexp query.

Fuzzy query

The fuzzy query is the fuzzy equivalent of the term query. See fuzzy query documentation for details.

Combining multiple clauses

Clauses can be as follows:

Compound clause:

{
  "bool": {
    "must": {},
    "must_not": {},
    "should": {}
  }
}

Filtering a query

{
  "query": {
    "filtered": {
      "query": {

      },
      "filter": {

      }
    }
  }
}

Filtering multiple values:

{
  "query": {
    "filtered": {
      "filter": {
        "terms": {
          "price": [20, 30]
        }
      }
    }
  }
}

A query as a filter

{
  "query": {
    "filtered": {
      "filter": {
        "bool": {
          "must": {

          },
          "query": {

          }
        }
      }
    }
  }
}

Boosting

{
  "query": {
    "bool": {
      "should": {
        "match": {
          "myfield": {
            "query": "some query",
            "boost": 2
          }
        }
      }
    }
  }
}

Practically, there is no simple formula for deciding on the "correct" boost value for a particular query clause. It's a matter of try-it-and-see.

It id possible to boost an index. The boosting logic can be much more intelligent, refer the documentation for details.

Sorting / ordering

By default, Elasticsearch orders matching results by their relevance score.

{
  "query": {

  },
  "sort": {
    "myfield": {
      "order": "desc"
    }
  }
}

Multilevel sorting:

{
  "query": {

  },
  "sort": [
    {
      "myfield": {
        "order": "desc"
      },
    },
    {
      "_score": {
        "order": "desc"
      }
    }
  ]
}

Sorting on Multivalue Fields (arrays).

Pagination

Use size and from keywords.
The size indicates the number of results that should be returned, default to 10.
The from indicates the number of initial results that should be skipped, default to 0.

Deep pagination is inefficient in Elasticsearch. Keep (from + size) under 1000.

Aggregation

Two main concepts:

GET /myindex/mytype/_search?search_type=count
{
  "aggs": {
    "aggname": {
      "terms": {
        "field": "myfield"
      }
    }
  }
}

Elasticsearch supports nested aggregations and combining aggregations and search.

Features

Highlight

Highlights fragments from the original text. See Highlighting documentation for details.

Aggregations

Aggregations allows to generate sophisticated analytics over your data.

Geolocation

Elasticsearch allows us to combine geolocation with full-text search, structured search, and analytics.

There are four geo-point filters available:

There are a lot of geolocation search optimizations including geohashes and geoshapes.

Relations

Elasticsearch, like most NoSQL databases, treats the world as though it were flat.

The "flat world" advantages:

If relations are required, consider these techniques:

Best practices

Index per time period (for time based data)

It may be one day or month for example.

Index templates

Index templates can be used to control which settings should be applied to newly created indexes. See Index templates.

Index the same data into multiple fields (use different analysis)

A common technique for fine-tuning relevance is to index the same data into multiple fields, each with its own analysis chain.

Dealing with redundant data

PUT /myindex
{
  "mappings": {
    "user": {
      "first_name": {
        "type": "string",
        "copy_to": "full_name"
      },
      "last_name": {
        "type": "string",
        "copy_to": "full_name"
      },
      "full_name": {
        "type": "string"
      }
    }
  }
}

A search time solution also available.

Use scroll with deep pagination

The scroll API can be used to retrieve large numbers of results (or even all results) from a single search request. See Scroll documentation.

Field-level index-time boost

Don't use it, use query-time boost instead. Query-time boosting is a much simpler, cleaner, more flexible option.

Capacity planning (how many shards do I need?)

There are too many variables: hardware, data size, document complexity, queries, aggregations, etc.

Try to play with a single server node:

Push this single shard until it "breaks". Once you define the capacity of a single shard, it is easy to find the number of primary shards required.

Configuration

Change cluster.name (elasticsearch.yml) to stop your nodes from trying to join another cluster on the same network with the same name.

Index rename or update

Use Index aliases.

9200 port forwarding

ssh -L 9201:localhost:9200 <server user>@<server ip>

server:9200 -> localhost:9201.

Vocabulary

A cluster

Is a group of nodes with the same cluster.name.

One node in the cluster is elected to be the master node, which is in charge of managing cluster-wide changes (does not need to be involved in document-level changes or search).

A node

Is a running instance of Elasticsearch.

A shard

A nodes container, holds a slice of all the data in the index.

Algorithm uses to route documents to shards:

shard = hash(routing) % number_of_primary_shards

That's why we can't increase the number of shards for an existing index.

An index

A logical namespace that points to one or more physical shards.

Analysis

How full text is processed to make it searchable.

Analysis consists of:

Built-in analyzers:

GET /_analyze?analyzer=standard
"Some text."
{
  "tokens": [
    {
      "token": "some",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "text",
      "start_offset": 5,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

Analyzer is a wrapper that combines three functions into a single package:

Creating a custom analyzer:

PUT /myindex
{
  "settings": {
    "analysis": {
      "char_filter": {

      },
      "tokenizer": {

      },
      "filter": {

      },
      "analyzer": {

      }
    }
  }
}

Document id

The id is a string that, when combined with the _index and _type, uniquely identified a document in Elasticsearch.

Document type

In Elasticsearch, a document belongs to a type, and those types live inside an index.

Parallels to a traditional relational database:

Every type has its own mapping or schema definition.
Every field in a document is indexed and can be queried.

Full-text search

Finds all documents matching the search keywords, and returns them ordered by relevance.

Full-text search is a battle between precision - returning as few irrelevant documents as possible, and recall - returning as many relevant documents as possible.

How far apart

How many times do you need to move a term in order to make the query and document match.

Indexing

The act of storing data in Elasticsearch.

Primary vs replica shards

The number of primary shards in an index is fixed at the time that an index is created (defaults to 5). The number of replica shards can be changed at any time (defaults to 0).

A replica shard is just a copy of a primary shard. Used to provide redundant copies of your data to protect against hardware failure, and to serve more read requests.

Any newly indexed document will first be stored on a primary shard, and then copied in parallel to the associated replica shard(s).

Inverted index

An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.

Mapping

How the data in each field is interpreted.

Every type has its own mapping (schema definition).

Simple core field types:

Complex core field types:

Relevance

Relevance is the algorithm that we use to calculate how similar the contents of a full-text field are to a full-text query string. The standard similarity algorithm used in Elasticsearch is known as "Term Frequency / Inverse Document Frequency" (TF/IDF).

Term Frequency - how often does the term appear in the field.
Inverse Document Frequency - how often does each term appear in the index.

Relevance score - how well the document matches the query.

Instruments

Kibana - an open source analytics and visualization platform designed to work with Elasticsearch
Sense - a Cool JSON Aware Interface to Elasticsearch (Chrome plugin)
Marvel - enables you to easily monitor Elasticsearch through Kibana

ELK stack (for logging):

Clients:
Elasticsearch Python client
Elasticsearch Node.js client

Services:
Amazon Elasticsearch Service

Elasticsearch: The Definitive Guide by Clinton Gormley and Zachary Tong

Licensed under CC BY-SA 3.0