Last month, our team hosted a hackathon for about a dozen Data Scientists who work for one of our customers. The Scientists are a mixed group of econometricians and general statisticians and we wanted to see what they could do with some time series data - in particular news articles and social media (mostly tweets).

A few days before the start of the hackathon, one of our customer contacts asked if I would load a few billion tweets into a temporary Elasticsearch cluster so that the hackathon participants could use it when they arrived.

I quickly violated some well learned muscle memories:

  1. I said: "yes, I can do that."
  2. I chose to install the most recent version of Elasticsearch (2.0 at the time.)

Installation of 2.0 is as you would expect if you have installed ES before. Very notable differences are the way in which you install Marvel and the new Marvel user interface. I want to use the image of Marvel below to tell this story.


To ingest, I wrote a python script that iterated through a massive repository of zipped GNIP files in Amazon's Simple Storage Service (S3). The first pain point you go through with an ingest like this is one of defining the Elasticsearch mapping. I thought I had it just right, let the ingest script rip, then checked back in on it in the morning. Turns out I missed quite a few fields (GNIP data has shifted formats over time) so I had to reingest about 40 million tweets. You can see my frustration in the first peak on the indexing rate graph below (also mirrored in the Index Size and Lucene Memory graphs.)


After ingesting all weekend, it was clear that I was never going to make it to a billion tweets by the start of the hackathon. I reached out to some of the HumanGeo Gurus and got some advice on how to tweak the cluster to improve ingest speed. The one visible piece of advice in the graph is in the blue circles: at that time I set index.number_of_replicas=0. You can tell that the size on disk was dramatically smaller (expected) and there is a small inflection point in the rate of ingest which you can only see in the document count line graph. Very disconcertingly, the (blue rectangle) indexing rate decreased! But the document count over time has clearly increased. I think this is because marvel is double(ish)-counting your index rate since it sees indexing happening into multiple replicas at the same time.


I was now resigned to have a database of ~800 million tweets which was still great but short of my 1bn personal goal. Additional fun occurred at the red circle. One of the nodes ran out of disk space, and in doing that it corrupted its transaction log. This ground indexing to a halt - bulk indexes weren't happening cleanly any more because Elasticsearch was missing shards. The cluster was missing shards because this failed node had the only copy (remember when I turned off replicas?!) and that copy was now offline.

The transaction log is one of those data store concepts that is supposed to save you from situations like this. I was running the cluster on AWS EC2, so the first thing I did was to stop Elasticsearch on that node, provision and move the index to a larger EBS volume, and start it back up. Elasticsearch tried to recover the shard by reading the transaction log, then discovered that it was corrupted, then gave up, then repeated the process.

One of the tools in your arsenal at this point is to say: forget about those transactions. So I removed the transaction log and restarted Elasticsearch. No dice - because of this bug in Elasticsearch 2.0 - Elasticsearch can neither recover from a corrupted transaction log nor reinitialize a missing transaction log.

My goal of 800 million tweets was now 250 million shy. But those tweets were still indexed! I was just being held up by a few bad eggs in that transaction log!

It was several paragraphs ago that the harder-core reader was literally punching their monitor in frustration because I hadn't considered trying to hack around the transaction log. The transaction log is a binary format specific to Elasticsearch and it creates a new one whenever you create a new index. What if I created an empty transaction log? Could I get my shard back online?

To get started, I created a new index in Elasticsearch which dropped a pristine transaction log on disk. I copied that transaction log into the right place in my broken index and restarted Elasticsearch. Elasticsearch complained that the UUID in the copied transaction log didn't match the UUID that the index was expecting. A UUID is just a unique identifier - the transaction log in a brand new index is otherwise identical to the transaction log in any other brand new index. In the log, Elasticsearch said it saw a UUID of 0xDEADBEEF when it expected 0x00C0FFEE. I opened the transaction log in hexedit and could see 0xDEADBEEF.  I copied the correct UUID over the incorrect UUID, saved it, restarted Elasticsearch, the shard came online, and then that gap in the red circle was filled in!

With a repaired transaction log and all shards online I was able to turn on the replicas to get back to a cluster with some durability and better search performance. Elasticsearch took a few more hours to build the replicas and balance the cluster.

The hackathon went live at the green circle. Immediately our sharp Scientists started issuing queries that completely blew up the fielddata in Elasticsearch. Funny thing about fielddata - it's the special part of memory that Elasticsearch uses to keep aggregations fast. It's slow to load data into fielddata so it tries to only do that once. So by default, the policy is unbounded. Fielddata will grow until you're out of memory and never evict data. Which basically means that once you've done an aggregation that pushed data into fielddata, it's there until that data is deleted or you restart Elasticsearch. So if you don't actually have enough RAM to hold all of the possible fielddata, it will by definition be fully used after a hard query and then newer (possibly more relevant) data can never fit in.

I think in many cases that scenario makes sense but I didn't have the luxury of adding additional resources. So I set indices.fielddata.cache.size=55%. 55% is a special consideration, since the hard limit without eviction is 60%. When you do this, you're accepting that ES will evict the least recently used fielddata when it is under memory pressure situations. I suspect that in most cases our users were doing queries in the beginning that were basically bad ideas so I didn't want to punish the future generations of queries with the mistakes of the past. (That sounds weirdly political.) Anyway, you see the huge spike drop once I put that policy in place.


Hopefully the above helps you if you end up in similar situations. Once the hackathon was underway it was a great success. I even overheard one of the participants say how great it was that there was already data available in the database - that's not normally how these things go, he said.

Some key takeaways from my recent Elasticsearch experience:

  • For ingest performance, turn off replicas
    • Beware! You are flying without an autopilot here. If something goes wrong, it will go very wrong.
  • For greenfield data exploration clusters, set indices.fielddata.cache.size=55% (or get moar RAM.)
  • Learn the internals of your database. This was one such exercise.
    • We see this one all the time with our clients. If you don't know the failure modes, the query considerations, and the ingest concerns of your database, you'll be buried under the weight of an early, non-data-driven decision.

Sounds Fun?

We're hiring. Do Data Science using Elasticsearch and Social Media with us.

Engineering Coda

If you're the kind of person who likes to be able to make computers do things while also writing the code that makes computers do things, I can't say enough about two tools that help me do this stuff. One is a clusterssh clone for iTerm2 called i2cssh and the other is called htop. i2cssh lets you login to a bunch of computers at the same time and have your typing go to all of them simultaneously (or individually.) Check out the matrix like screen shots below where I'm running htop on 4 nodes and then 7 nodes (I increased the size of the cluster at one point.)

Thanks to Michele Bueno, Scott Fairgrieve, and Aaron Gussman for reviewing drafts of this post.