Blog

IR: From Lucene To Elasticsearch – Part I

Introduction

You surely must have heard about Apache Lucene, Apache Solr, Elasticsearch, Kibana and Logstash. Every one is talking about how Lucene is considered a revolution in information retrieval systems, how elasticsearch is fast and scalable and how kibana is easy and intuitive. So what is this all about?
Over the last few years, a lot of companies have shifted to Elasticsearch and for that more people have started talking about it and its ecosystem.

Hopefully, this series of articles would be an in depth guide to how IR systems work and we will get to learn state of art technologies that are widely adopted and recognized as the best in the field.

In this part, we will talk about the indexing process. Since Lucene and Elasticsearch are all about **indexing** data and retrieving them, it is **mandatory** to understand what does this term mean and how it works.

I. Why Indexing

So, to explain indexing, I would like to start with the a real example demonstrating why we need it.

Consider the case of searching for a particular word in a document. One would take an entire document, split it by white-spaces and iterate over the resulted words to compare them with the searched one. This is called **burte force search**.

Suppose that comparing two words takes about `t=100µs` (which is a real life example [https://thisdata.com/blog/timing-attacks-against-string-comparison/](https://thisdata.com/blog/timing-attacks-against-string-comparison/)).
Let’s define:

  • `n`: number of documents
  • `m`: average number of words in each document
  • `w`: number of words in our query

Search complexity would be: `O(n*m*w)` and total searching time can be calculated as: `n*m*w*t`

Now let’s measure the performance! :

  • For `n = 1k`
  • For `m = 10k`
  • For `w = 5`

Total time: `n*m*w*t = 1k*10k*5*t = 50k²t = 50.000.000t` = `50.000.000 * 100µs = 1.38 hours`

But ..

 

Google does it in less than a second. There is no magic behind it, just some very good engineered algorithms for real time information retrieval.

The way it works is that google has several bots that navigate through web pages, **indexing** their content.

An index is very similar to that of a book.

Indexing is a challenging task. It’s about storing information in a manner that allows as-fast-as-possible information retrieval with as high as possible precision.

It consists of creating a new data structure in the form of a look-up table or rather called **inverted index**. :arrow_right: Instead of storing words by book, we store books by words.

II. Indexing Process

Indexing a document is done usually done by first extracting the document’s content and each word will be represented in an **inverted index** as a `term` **=** `posting list` (as in **map-reduce**).

When Searching for a word, we search in our look-up table.

This how an inverted index would look like:

Term Posting List
`Lucene` `{1, 5, 3}`
`Java` `{1, 6}`

This indicates that the term `Lucene` exists in all of documents 1, 3 and 5, while the term `Java` exists only on documents 1 and 6.

To learn how an inverted index is created, let’s look at a concrete example:

Let’s consider the following documents:

  • Document[1]: `The big brown fox jumped over the lazy dog`
  • Document[2]: `The brown fox is Firefox`

This time we will improve our inverted index by adding the term’s position and offset in the document, for a more precise IR.

We define a new representation of a posting list as the couple: `(doc_id, posision)`. So `(5, 10)` would mean that a term exists in document 5 at the position 10.

Our inverted index would then look like this:
Inverted Index

Term Posting List
`The` `{(1, 0), (1, 32), (2, 0)}`
`big` `{(1, 4)}`
`brown` `{(1, 8), (2, 4)}`
`fox` `{(1, 10), (2, 14)}`
`jumped` `{(1, 18)}`
`over` `{(1, 27)}`
`lazy` `{(1, 36)}`
`dog` `{(1, 41)}`
`is` `{(2, 14)}`
`Firefox` `{(2, 17)}`

This data structure is sufficient to retrieve the document containing a particular term, but it is not enough. We can still enhance it by adding more information such as:

  • Sort the Inverted Index by alphabet over terms’ field (improve searching)
  • Remove meaningless words like `the`, `a`, `at`, … (language specific)

As you can see indexing can be a **language specific task**. It is very challenging and there is no one solution to rule them all.

In fact some of the challenges can be:

  • Should numbers be indexed?
  • Capitalization, would `Lucene` = `lucene`?
  • Handling punctuation`?!:;,.` etc.
  • Word stemming (index word root, such as `use` instead of `using`)
  • Variations in spelling
  • And lot more challenges.

Thus, to better present indexing workflow, it would look more like this:

1. Analysis: The analysis phase consists of splitting the input into **tokens**, removing unnecessary data such as numbers, separators (commas, dots, etc), filtering (such as removing blacklisted words).
2. Indexing: Transforming each term into a posting list and appending it to the inverted index.

Querying an IR system will trigger a task that search for a term in an inverted index. Modern indexing engines use cutting edge algorithms that are constantly evolving to optimize searching performance. One common solution to optimize term searching is the use of **Trie**s, also commonly known as **prefix tree**.

A Trie is a tree data structure for storing terms that can be easily searched later on.

 

This is for instance a trie representing the following terms `A`, `to`, `tea`, `ted`, `ten`, `i`, `in`, and `inn`. (*Image source:*[https://en.wikipedia.org/wiki/Trie#/media/File:Trie_example.svg](https://en.wikipedia.org/wiki/Trie#/media/File:Trie_example.svg))

A node in the tree has a value (in this case a number) if it is a complete term that exists and that number would for instance refer to the position on which the term exist in the following inverted index

# Term Posting List
`…` `…` `…`
3 `tea` `{(1, 51), (4, 155)}`
4 `ted` `{(15, 6544), (412 1887)}`
`…` `…` `…`
12 `ten` `{(98, 1566)}`

You can read more about Tries here: [https://en.wikipedia.org/wiki/Trie](https://en.wikipedia.org/wiki/Trie).

III. Scoring

Now that we understand better how indexing works, it is also important to evaluate the IR system. For instance, by querying our inverted for `brown firefox`, we will receive both documents since at least a term in our query exists in both of them. So which result is more pertinent?

For this purpose we need to define a **relevance metric**. This called, **scoring**.

There are two values that need to be evaluated separately:

1. TF: **Term Frequency**: The more terms (that exists in our query) the document contains, the better it is :arrow_right: higher score
2. DF: **Document Frequency**: The more documents containing that term, the less **special** this document is (lower score)

We define another term **IDF** as the inverse of `DF`, `IDF = 1 / DF`.
Since higher DF would result in lower IDF and less score.

So the final scoring formula becomes => `Score = TF*IDF`.

 Scoring Example

Let’s retake the same example we used before to present the data structure. Our inverted index looked like this:

Term Posting List
`The` `{(1, 0), (1, 32), (2, 0)}`
`big` `{(1, 4)}`
`brown` `{(1, 8), (2, 4)}`
`fox` `{(1, 10), (2, 14)}`
`jumped` `{(1, 18)}`
`over` `{(1, 27)}`
`lazy` `{(1, 36)}`
`dog` `{(1, 41)}`
`is` `{(2, 14)}`
`Firefox` `{(2, 17)}`

Now, let’s query our inverted index for the following terms: {`the`, `Firefox`}, and calculated `TF` and `IDF` for each term (note that IDF is related to all document and not a particular one):

– **For Document 1:**

Term TF IDF
`the` 2 1/3
`Firefox` 0 1|

– **For Document 2:**

Term TF IDF
`the` 1 1/3
`Firefox` 1 1

Let’s measure each document’s score for the term `the`:
– Document 1: `TF = 2, IDF = 1/2, Score = 1` (since `the` is present twice in our document and also present in two different documents).
– Document 2: `TF = 1, IDF = 1/2, Score = 1/2`

So when it comes to term `the`, document 1 is more relevant.

Let’s do the same for the term `Firefox`:
– Document 1: `TF = 0`, `IDF = 1`, `Score = 0`
– Document 2: `TF = 1`, `IDF = 1`, `Score = 1`

Now, the overall score becomes:

– Document 1: `1 + 0 = 1`
– Document 2: `1/2 + 1 = 1.5` :star:

As we can see, document 2 has higher score than the first one, thus it is more relevant.

The scoring forumla is in fact more complicated. Let’s say that some one paid us a fair amount of money to **boost** the first document, in other words, it would have higher (or the highest) score as long at it exists in the search result list.

This is often called, **score boosting**. There are various types of **boosting**, to give some documents or terms higher scores than others. Google for instance does this to display paid ads on its search results list, it is a key part of their business model.

Score boosting involves adding additional called **weights** to influence score based on various options such as the money paid to display a particular result.

IV. Querying

There is one major information left untold, which is querying. Now that have index multiple documents, we may perform more complex searches.

There are multiple types of queries, let’s site some of the most used:

1. Term Query

This is a normal query, searching for the following terms {`Apache`, `Lucene`} would result in all documents containing **both** of terms.

2. Wildcard Query

This type query searches for terms that follows a specific patterns using wild cards. A wild card is denoted as `*` which means it can be any kind of values, example:
– Inverted index containing: `lucene`, `luke`, `light`, `linear`.
– Query: `l*n*`
– Result: `lucene`, `linear`

3. Fuzzy Query

Fuzzy query is an approximate type of search. It returns documents that are closest to query terms. The **similarity** is usually measured with [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance). Resulted documents does not necessary contain exact terms entered in the query.

4. Boolean Query

This type of query specifies some terms does must or must not exist in retrieved documents.

5. Range Query

Range queries simply indicates that some fields need to be in a particular range.

Conclusion

In this part, we have presented the general need of indexing and explained in depth how does indexing works from creating an inverted index to search result scoring and querying. We have also talked about more advanced searching techniques such as prefix trees.

I hope this article was very informative. In the next part, we will get to know Apache Lucene and dive into some code.

Write a comment