Search is a common building block for applications. Whether we are searching Wikipedia or our log files, the behaviour is similar. A query is entered and the most relevant documents are returned. The core data structure for search is an inverted index.
Elasticsearch is scalable, resilient search tool that shards and replicates a search index.
We know what a search engine is. We all use Google to search webpages, but there are many other things that can be searched through.
Pretty much everywhere where we have lots of information, we want to search for something. Elasticsearch, for example, is widely used in lots of websites and applications.
One of them would be Wikipedia. If we search anything in the search box in Wikipedia, our search will actually go through Elasticsearch in the background. If we search on GitHub, or Stack Overflow, all of those will actually go through Elasticsearch in the backend. If we don’t want to rely on our search going through Google, but have it on our own servers or within our own site, then we would need some extra software to actually do that for us.
What is Elasticsearch?
Elasticsearch is one of these full-text search engines. It’s based on Apache Lucene. That is actually what is doing all the hard work in the background. Lucene is the thing storing the data on this and making it searchable. It’s doing the index process; so stop words, stemming, tokenization, all of that is Lucene’s work. Kind of around that, a shell, we have Elasticsearch. That shell or that outer layer, the Query DSL, it provides rest interface and does the replication and distribution of our data over multiple nodes. The actual search work is done by Apache Lucene, and then a nice way to interact that, that is provided by Elasticsearch.
In the background, we store our data to one index, the index is pretty much like a database in the database world. That one index can be split into multiple so-called shards. By default, those would be five shards. Each of these shards can either live on server, or if we have multiple servers, the data would be distributed over our multiple servers automatically. That is how we split up data. If we have more than five servers and want to use more servers for storing our one index, we would need to change the number of shards when we create that index.
In addition to that sharding, we also replicate the data. By default, we have a replica of one, basically, meaning we have our five shards. Then, for each one of these shards, we would have one copy on another node. If one node dies and a few shards go missing with that node, we will still have those replicated on another instance, and that is kind of the basic concept how to distribute data and also be resilient to failures. By using that approach, it’s very easy to scale horizontally, so we just add more nodes to our cluster and our shards will be automatically distributed and replicated and that’s basically all we need to care for.
Why would we shard? How many shards do we need and what is our replication factor?
Think about replication, is they help with two things. First; they make our data more resilient to failures. Depending on how many copies of our data we have, we can take more numbers of service that fail. Second, in the case of Elasticsearch, the replicas can also be used to increase our read capabilities. We can either read on the primary shard, or any of its replicas.
If we have some high-traffic website with — For example, we have a shop, and Black Friday comes along and lots of people want to read more data from our shop, we could just dynamically increase the number of replicas to scale up our read since we have more sources to get our data from.
On the other hand, to scale up right, we will need more shads so that more service can spread that writing load evenly over all of them. What further kind of dictates the number of shards we want to have is either how much data can one node store. How many service will we need to store the data. B; what is the right throughput we want to achieve overall our nodes. That will kind of dictate our number of replicas and shards. There are no fixed numbers.
Often, people ask, “What should those numbers be?” Our answer is always, “It depends.” One thing we should definitely avoid is thinking, “Well, I don’t know many shards we’ll use in the future. I will just add a hundred,” which is totally a random value, but we should not do that, because every shard has a specific overhead in terms of file handles and memory. If we have a rather small cluster and have lots and lots of shards, we always call that oversharding and we will just lose a lot of resources just for sharding our data.
Primary(P0,P1) and Replica(R0,R1)
Also, then, if we search over all our shards, we will need to combine those results from all the shards. That adds quite some of our overhead. It allows we to scale up very highly, but if we never have that problem to scale that much, we’re just wasting lots of resources on the number of shards. There is no final answer what is kind of the right shard or replica number. It will always depend on our use case. Also, what is our write load, what our read load is, and what growth do we anticipate for the future.