In the early days of my playing with ElasticSearch, I remember struggling with some of the basic terminology and concepts. Naturally, many of us try to equate ElasticSearch, with what we know of RDBMS. To that end, I thought of posting this simple topic as a reference to others:
Node = DB Instance
One Database Instance
A node is simply one ElasticSearch instance (1 java process).
Consider this a running instance of MySQL. Just like you can have more than one MySQL instance running per machine on different ports… you can have more than one elasticsearch node running per machine on different ports.
Cluster = Database Cluster
1..N Nodes with the same Cluster Name.
Index = Database Schema
Similar to a Database, or Schema. Consider it a set of tables with some logical grouping. In ElasticSearch terms, an index is a Collection of Documents; where a “Document” is similar to a DB table.
Mapping Type = Database Table
ElasticSearch uses document definitions that act as tables. If you PUT (“Index”) a document in ElasticSearch, you will notice that it automatically tries to determine the property types. This is like inserting a JSON blob in MySQL, and MySQL determining the number of columns and column types (int, string, datetime, etc…) as it creates the DB table for you, on-the-fly.
Note: I’ve heard this refered to as “Type”, “Document Type”, and “Mapping Type”.
Shard = Uhhh…
I don’t think this one has a DB equivalent, but it’s likely the most important aspect listed here. A Shard is the smallest unit of worker in your cluster. It is one running Lucene instance. Shards are distributed across all of the nodes in your cluster and they are what makes ElasticSearch, elastic, sort-a-speak; giving your information and ES process redundancy.
Pleased to announce the release of ElasticHQ v1.0.0.
This release added:
- Support for ElasticSearch v1.0.0RC1 and unbroke the breaking changes.
- Support for monitoring multiple file systems
- Support for G1 GC
- Allow user to select which nodes are displayed on the Diagnostics Screen
Hats off to the ElasticSearch team, as this release seemed to tie up a lot of loose ends and added some great bits like the _cat API, federated search, and the ability to Backup/Restore on a running instance - more
ElasticSearch recently announced, some of the numbers behind their ever-increasing rates of adoption. You can see the post, by their CEO here. Hitting the 5m download mark is an impressive milestone, and with 500,000 d/ls / month, it seems to be increasing (at an increasing rate). The buzz surrounding ES seems to have some legs under it after all.
That blog post inspired me to take a look at some of the ElasticHQ numbers and see if we can dig a bit deeper in to ElasticSearch usage patterns. We are in a unique position of being able to gather and analyze generic usage and environmental data. ElasticHQ is less than a year old, but is widely used by Fortune 100 companies and smaller companies alike. The user-base is widely distributed across developers and system engineers / sysops. I mention the two previous points, because they effectively skew data… when analyzing user patterns, one has to take the user (actor) in to account. Unfortunately, ElasticHQ can’t read job titles or intent, so I had to make due with raw data and assume a margin of error across ~10,000 unique clusters.
Now… enough typing. More numbers and pretty charts…
% Clusters vs. ElasticSearch version
Distribution: # Nodes per Cluster
# Documents per Cluster
- Quartile 1: 20,718 Documents
- Median: 1,134,029 Documents
- Quartile 3: 30,047,243 Documents
- Maximum: 4,294,967,295 Documents
# Indices per Cluster
ElasticSearch distribution across the globe….
- United States: 31.95%
- France: 7.34%
- Germany: 6.29%
- United Kingdom: 5.92%
- India: 3.98%
- Brazil: 2.67%
- Russia: 2.61%
- Netherlands: 2.48%
- China: 2.34%
- Canada: 2.19%
There is a lot more data to share with this respect, but I have only so many free hours in a week. It’s interesting to see what is detailed here so far, as in summary it hints at ElasticSearch use and deployment patterns:
- v.90.5 as the most common version used. (admittedly, I didn’t take version adoption over time in to account)
- Most clusters have a small # of nodes (hardware pricey? Are we tracking a large # of dev boxes?)
- Over half of the deployments here have (what I consider) to be medium to large document stores
- A small (un-complex?) number of indices per cluster.
If this can be trusted as a gauge for ElasticSearch usage in the wild, it will be interesting to see how it changes over time, and more importantly… where it leads ElasticSearch (the company), as it may give a hint as to the user-base make-up. ElasticHQ sees daily use by large companies (Disney, eBay, Goldman Sachs, Siemens, etc…), yet usage is heavily skewed toward SMBs and startups. I can only assume the data gathered here and the companies using HQ every day are an accurate depiction.
A few months ago, I delivered a presentation at the local Atlanta Java User Group, covering an Introduction to ElasticSearch. Below, you can find the video and the slides (uploaded to slideshare)… Warning: the presentation clocks-in at two hours!
The presentation covers configuring ElasticSearch clusters, how the underlying Lucene engines work, how nodes communicate, JSON & REST API examples, and a short Q&A session at the very end.
Introduction to ElasticSearch
AJUG – Introduction to ElasticSearch – Roy Russo (09.17.2013) from AJUG on Vimeo.