Elasticsearch notes
- Elasticsearch notes
- Installation
- API
- Queries
- Features
- Best practices
- Index per time period (for time based data)
- Index templates
- Index the same data into multiple fields (use different analysis)
- Dealing with redundant data
- Use scroll with deep pagination
- Field-level index-time boost
- Capacity planning (how many shards do I need?)
- Configuration
- Index rename or update
- 9200 port forwarding
- Vocabulary
- Instruments
- Links
Elasticsearch is a real-time distributed search and analytics engine built on top of Apache Lucene.
- Available under the Apache 2 license
- Created by Shay Banon (the first public release came out in February 2010)
- Provides simple, coherent, RESTful API
- A distributed document store / search engine
- Horizontal scalability, high reliability
- Uses by Wikipedia, The Guardian, Stack Overflow, GitHub, and many other
Installation
brew install elasticsearh curl http://localhost:9200/?pretty
API
Default port: 9200.
Examples
const axios = require('axios') const co = require('co') const request = ({uri, method, body=null}) => { console.log(`> ${method} ${uri}`) if (body) { console.log(JSON.stringify(body, null, 2)) } return axios.request({ method: method, url: `http://localhost:9200${uri}`, data: body }).then(response => ({ body: JSON.stringify(response.data, null, 2), headers: JSON.stringify(response.headers, null, 2), status: response.status })) } co(function *() { let response = yield request({ uri: '/', method: 'GET' }) console.log(response.body) // console.log(response.headers) // console.log(response.status) }).catch(error => console.log(error))
Elasticsearch status
GET /
{ "name": "Angel Salvadore", "cluster_name": "elasticsearch_nanvel", "version": { "number": "2.3.2", "build_hash": "b9e4a6acad4008027e4038f6abed7f7dba346f94", "build_timestamp": "2016-04-21T16:03:47Z", "build_snapshot": false, "lucene_version": "5.5.0" }, "tagline": "You Know, for Search" }
Response headers and status code:
{ "content-type": "application/json; charset=UTF-8", "content-length": "331" } 200
Cluster health
GET /_cluster/health
{ "cluster_name": "elasticsearch_nanvel", "status": "yellow", "timed_out": false, "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 5, "active_shards": 5, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 5, "delayed_unassigned_shards": 0, "number_of_pending_tasks": 0, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 0, "active_shards_percent_as_number": 50 }
Status names:
- green: all primary and replica shards are active
- yellow: all primary shards are active, but not all replica shards are active
- red: not all primary shards are active
Index a document
PUT /myindex/mytype/1 { "attr1": "val1", "attr2": 42 }
{ "_index": "myindex", "_type": "mytype", "_id": "1", "_version": 1, "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true }
Elasticsearch creates an index and type automatically if they don't exist.
Check whether a document exists
HEAD /myindex/mytype/1
Response status code == 200 if the document was found or 404 otherwise.
Get a document
GET /myindex/mytype/1
{ "_index": "myindex", "_type": "mytype", "_id": "1", "_version": 1, "found": true, "_source": { "attr1": "val1", "attr2": 42 } }
Delete a document
DELETE /myindex/mytype/1
{ "found": true, "_index": "myindex", "_type": "mytype", "_id": "1", "_version": 2, "_shards": { "total": 2, "successful": 1, "failed": 0 } }
Search lite
Search Lite expects all search parameters to be passed in the query string.
GET /myindex/mytype/_search?q=attr2:42
{ "took": 42, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.30685282, "hits": [ { "_index": "myindex", "_type": "mytype", "_id": "1", "_score": 0.30685282, "_source": { "attr1": "val1", "attr2": 42 } } ] } }
Search with Query DSL
Query DSL - the flexible, powerful query language used by Elasticsearch.
Expects all search parameters to be passed in the body json.
GET /myindex/mytype/_search { "query": { "match": { "attr2": 42 } } }
{ "took": 6, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.30685282, "hits": [ { "_index": "myindex", "_type": "mytype", "_id": "1", "_score": 0.30685282, "_source": { "attr1": "val1", "attr2": 42 } } ] } }
There are two DSLs:
- query DSL (asks: how well does this document match?)
- filter DSL (yes or no for document, uses only with exact values, do not calculate relevance)
Filter examples:
- the date is in range ...?
- does it contain the field?
- is the coordinates field within 10km of a specified point?
Query examples:
- full text search, best matching results
- documents containing specified tags - the more tags, the more relevant the document
Filters are more efficient in the most cases, they are easy to calculate and cache.
The goal of filters
The goal of filters is to reduce the number of documents that have to be examined by the query.
As a general rule, use query clauses for full-text search or for any condition that should effect the relevance score, and use filter clauses for everything else.
Available filters:
- term filter (filter by exact values)
- terms filter (allows to specify multiple filters to match)
- range filter (find numbers or dates that fall into a specified range)
- exists and missing filters (has one or more values or doesn't have any values)
- bool filter (uses to combine multiple filter clauses)
Available queries:
- match_all query (matches all documents)
- match query (use it for a full-text or exact value)
- multi_match query (match on multiple fields)
- bool query (combines multiple query clauses, calculates a relevance score)
See also filtered query and combining queries together.
Filter order
More specific filters must be placed before less-specific filters in order to exclude as many documents as possible, as early as possible.
Cached filters are very fast, so they should be placed before filters that are not cacheable.
Validate a query
GET /myindex/mytype/_validate/query[?explain] { "query": { } }
Index settings
PUT /myindex2 { "settings": { "number_of_shards": 3, "number_of_replicals": 1 } }
{ "acknowledged": true }
See also:
Specifying field mapping
Mapping attributes: index and analyzer ("whitespace", "simple", "english", ...).
The index attribute controls how the string will be indexed:
- analyzed (default): analyze -> index
- not_analyzed: index the value exactly as specified
- no: don't index the field
PUT /myindex { "mappings": { "mytype": { "properties": { "mystr": { "type": "string", "analyzer": "english" }, "mynumber": { "type": "long" } } } } }
{ "acknowledged": true }
It is possible to add a new field type with
PUT /myindex/_mapping/newfield
Queries
The empty search
Returns all documents.
GET /_search All types in the index/indexes: GET /myindex/_search GET /myindex,anotherindex/_search GET /my*,another*/_search Search type in the index: GET /myindex/mytype/_search Search type inside all indexes: GET /_all/mytype/_search
equal to:
GET /_search { "query": { "match_all": {} } }
{ "took": 14, "timed_out": false, "_shards": { "total": 13, "successful": 13, "failed": 0 }, "hits": { "total": 4, "max_score": 1, "hits": [ ... ] } }
Exact match
GET /_search { "query": { "match": { "attr2": 42 } } }
Full text search
GET /myindex/mytype/_search { "query": { "match": { "attr1": "ipsum" } } }
{ "took": 6, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.19178301, "hits": [ { "_index": "myindex", "_type": "mytype", "_id": "_search", "_score": 0.19178301, "_source": { "attr1": "lorem ipsum ...", "attr2": 1 } } ] } }
See languages elasticsearch supports.
For language detection see chromium-compact-language-detector.
Phrase Search
{ "query": { "match_phrase": { "attr1": "lorem ipsum" } } }
Same as:
{ "query": { "match": { "attr1": "lorem ipsum", "type": "phrase" } } }
The match_phrase
query first analyses the query string to produce a list of terms. It then searches for all the terms, but keeps only documents that contain all of the search terms, in the same position relative to each other.
Wildcard queries
Wildcards available:
?
- matches any character*
- matches zero or more characters
{ "query": { "wildcard": { "postcode": "w?F*HW" } } }
See also regexp query.
Fuzzy query
The fuzzy query is the fuzzy equivalent of the term query. See fuzzy query documentation for details.
Combining multiple clauses
Clauses can be as follows:
- Leaf clause (match)
- Compound clause (combine other query clauses, including other compound clauses)
Compound clause:
{ "bool": { "must": {}, "must_not": {}, "should": {} } }
Filtering a query
{ "query": { "filtered": { "query": { }, "filter": { } } } }
Filtering multiple values:
{ "query": { "filtered": { "filter": { "terms": { "price": [20, 30] } } } } }
A query as a filter
{ "query": { "filtered": { "filter": { "bool": { "must": { }, "query": { } } } } } }
Boosting
{ "query": { "bool": { "should": { "match": { "myfield": { "query": "some query", "boost": 2 } } } } } }
Practically, there is no simple formula for deciding on the "correct" boost value for a particular query clause. It's a matter of try-it-and-see.
It id possible to boost an index. The boosting logic can be much more intelligent, refer the documentation for details.
Sorting / ordering
By default, Elasticsearch orders matching results by their relevance score.
{ "query": { }, "sort": { "myfield": { "order": "desc" } } }
Multilevel sorting:
{ "query": { }, "sort": [ { "myfield": { "order": "desc" }, }, { "_score": { "order": "desc" } } ] }
Sorting on Multivalue Fields (arrays).
Pagination
Use size
and from
keywords.
The size
indicates the number of results that should be returned, default to 10.
The from
indicates the number of initial results that should be skipped, default to 0.
Deep pagination is inefficient in Elasticsearch. Keep (from + size) under 1000.
Aggregation
Two main concepts:
- buckets: collections of documents that meet a criterion (similar to grouping in SQL)
- metrics: statistics calculated on the documents in a bucket (similar to
count()
,sum()
, etc. in SQL)
GET /myindex/mytype/_search?search_type=count { "aggs": { "aggname": { "terms": { "field": "myfield" } } } }
Elasticsearch supports nested aggregations and combining aggregations and search.
Features
Highlight
Highlights fragments from the original text. See Highlighting documentation for details.
Aggregations
Aggregations allows to generate sophisticated analytics over your data.
Geolocation
Elasticsearch allows us to combine geolocation with full-text search, structured search, and analytics.
There are four geo-point filters available:
geo_bounding_box
: find geo-points that fall within the specified rectanglegeo_distance
: find geo-points within the specified distance of a central pointgeo_distance_range
: find geo-points within specified minimum and maximum distance from a central pointgeo_polygon
: find geo-points that fall within the specified polygon (very expensive)
There are a lot of geolocation search optimizations including geohashes and geoshapes.
Relations
Elasticsearch, like most NoSQL databases, treats the world as though it were flat.
The "flat world" advantages:
- indexing is fast and lock-free
- searching is fast and lock-free
- massive amounts of data can be spread across multiple nodes, because each document is independent of the others
If relations are required, consider these techniques:
Best practices
Index per time period (for time based data)
It may be one day or month for example.
Index templates
Index templates can be used to control which settings should be applied to newly created indexes. See Index templates.
Index the same data into multiple fields (use different analysis)
A common technique for fine-tuning relevance is to index the same data into multiple fields, each with its own analysis chain.
Dealing with redundant data
PUT /myindex { "mappings": { "user": { "first_name": { "type": "string", "copy_to": "full_name" }, "last_name": { "type": "string", "copy_to": "full_name" }, "full_name": { "type": "string" } } } }
A search time solution also available.
Use scroll with deep pagination
The scroll API can be used to retrieve large numbers of results (or even all results) from a single search request. See Scroll documentation.
Field-level index-time boost
Don't use it, use query-time boost instead. Query-time boosting is a much simpler, cleaner, more flexible option.
Capacity planning (how many shards do I need?)
There are too many variables: hardware, data size, document complexity, queries, aggregations, etc.
Try to play with a single server node:
- create a cluster consisting of a single server
- create an index with one primary shard and no replicas
- fill it with real documents
- run real queries and aggregations
Push this single shard until it "breaks". Once you define the capacity of a single shard, it is easy to find the number of primary shards required.
Configuration
Change cluster.name
(elasticsearch.yml
) to stop your nodes from trying to join another cluster on the same network with the same name.
Index rename or update
Use Index aliases.
9200 port forwarding
ssh -L 9201:localhost:9200 <server user>@<server ip>
server:9200
-> localhost:9201
.
Vocabulary
A cluster
Is a group of nodes with the same cluster.name
.
One node in the cluster is elected to be the master node, which is in charge of managing cluster-wide changes (does not need to be involved in document-level changes or search).
A node
Is a running instance of Elasticsearch.
A shard
A nodes container, holds a slice of all the data in the index.
Algorithm uses to route documents to shards:
shard = hash(routing) % number_of_primary_shards
That's why we can't increase the number of shards for an existing index.
An index
A logical namespace that points to one or more physical shards.
Analysis
How full text is processed to make it searchable.
Analysis consists of:
- tokenizing a block of text into individual terms suitable for use in an inverted index
- normalizing these terms into a standard form to improve their "searchability"
Built-in analyzers:
- Standard analyzer (splits the text on word boundaries, removes most punctuation, and lowercase all items)
- Simple analyzer (splits the text on anything that isn't a letter, and lowercase the items)
- Whitespace analyzer (splits the text on whitespace, doesn't lowercase)
- Language analyzers (language specific analyzers)
GET /_analyze?analyzer=standard "Some text."
{ "tokens": [ { "token": "some", "start_offset": 0, "end_offset": 4, "type": "<ALPHANUM>", "position": 0 }, { "token": "text", "start_offset": 5, "end_offset": 9, "type": "<ALPHANUM>", "position": 1 } ] }
Analyzer is a wrapper that combines three functions into a single package:
- character filters (removes html tags, etc.)
- tokenizers (breaks up a string into individual terms)
- token filters (change, add, or remove tokens)
Creating a custom analyzer:
PUT /myindex { "settings": { "analysis": { "char_filter": { }, "tokenizer": { }, "filter": { }, "analyzer": { } } } }
Document id
The id
is a string that, when combined with the _index
and _type
, uniquely identified a document in Elasticsearch.
Document type
In Elasticsearch, a document belongs to a type, and those types live inside an index.
Parallels to a traditional relational database:
- Databases -> Indexes
- Tables -> Types
- Rows -> Documents
- Columns -> Fields
Every type has its own mapping or schema definition.
Every field in a document is indexed and can be queried.
Full-text search
Finds all documents matching the search keywords, and returns them ordered by relevance.
Full-text search is a battle between precision - returning as few irrelevant documents as possible, and recall - returning as many relevant documents as possible.
How far apart
How many times do you need to move a term in order to make the query and document match.
Indexing
The act of storing data in Elasticsearch.
Primary vs replica shards
The number of primary shards in an index is fixed at the time that an index is created (defaults to 5). The number of replica shards can be changed at any time (defaults to 0).
A replica shard is just a copy of a primary shard. Used to provide redundant copies of your data to protect against hardware failure, and to serve more read requests.
Any newly indexed document will first be stored on a primary shard, and then copied in parallel to the associated replica shard(s).
Inverted index
An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.
Mapping
How the data in each field is interpreted.
Every type has its own mapping (schema definition).
Simple core field types:
- string:
string
- number:
byte
,short
,integer
,long
- floating point:
float
,double
- boolean:
boolean
- date:
date
Complex core field types:
- null
- arrays
- objects
Relevance
Relevance is the algorithm that we use to calculate how similar the contents of a full-text field are to a full-text query string. The standard similarity algorithm used in Elasticsearch is known as "Term Frequency / Inverse Document Frequency" (TF/IDF).
Term Frequency - how often does the term appear in the field.
Inverse Document Frequency - how often does each term appear in the index.
Relevance score - how well the document matches the query.
Instruments
Kibana - an open source analytics and visualization platform designed to work with Elasticsearch
Sense - a Cool JSON Aware Interface to Elasticsearch (Chrome plugin)
Marvel - enables you to easily monitor Elasticsearch through Kibana
ELK stack (for logging):
- Elasticsearch
- Logstash (collects, parses, and enriches logs before indexing them into Easticsearch)
- Kibana (is a graphic frontend that makes it easy query and visualize what is happening across your network in near real-time)
Clients:
Elasticsearch Python client
Elasticsearch Node.js client
Services:
Amazon Elasticsearch Service
Links
Elasticsearch: The Definitive Guide by Clinton Gormley and Zachary Tong