I learned about search 2 years ago, when I attended a lecture on Information retrieval. I have been wanting to make a search engine since then. I finally took the time to make one this year. The project indexes HackerNews and is deployed here.
In this post, I will walk you through the theory behind making one and the process I took to make it. This post will not be a code walkthrough; however, you should be able to follow the code here after this post.
A search engine helps you find documents that contain information you are looking for. A simple implementation would just loop through all the documents and return the ones that contain the words we are looking for.
documents.filter(doc -> doc.contains(query))
This approach is slow and treats every document as equal. If your search engine returned all the documents without any ranking you will probably be not very happy. Let’s try to address these problems.
We will use something called an inverted index to make it faster. Instead of comparing the entire document to find a word, we will invert the process and
match a word
to a document. We can achieve this by using a hashmap that maps a word to list of document-ids.
apple -> [1, 4, 5]
cat -> [3, 5]
This allows us to find all relevant documents in O(1) time. Even though building the index still takes O(N) time, we only need to build the inverted index once, and we can process all user queries using the same index.
We will refer to each word (key) in the map as token
. A token is how we
choose to represent a word in the map.
Our index currently matches the exact word, this is a problem. What if our user searches for “apples”? The current index would return no document even though the documents that mention “apple” might be what the user wants. We can address this by processing our text before adding it to the index. This processing is called normalization.
In our search engine, we will use 2 types of normalization.
Stemming: There are several stemming algorithms. We will use Porter Stemming algorithm.
Note: We will need to use the same pre-processing on user queries to find the correct tokens in our index.
Some words such as “and, the” are used frequently and are probably not something you use in your search queries. We will choose to exclude these stop words in our processing stage, since it will help us speed up queries.
So our final processing stage looks like:
tokens = words.filter(word -> !word.isStopWord()).map(word -> word.lower() -> word.stem())
Now that we can find the documents faster, how do we rank them? We will use a simple ranking algorithm that relies on these principles:
We will rank each document using the formula:
rank(document, token) = token_frequency(tf) * inverse_document_frequency(idf)
We will learn what these two terms mean in the following sections.
We can calculate and store frequency of each token in a document when we are building our index. However, raw token frequency (tf) is not very useful, as it
favors longer documents. Instead, we will use log(frequency)
to dampen this effect.
What do I mean by rarity? Imagine you search for “happy frieren”, the term “happy” appears in a lot more documents compared to “frieren” and there’s a high probability that you are looking for documents that mention “frieren” so we need to give more importance to the documents that mention “frieren”. That is, the documents that mention the rare terms should rank higher.
We will determine rarity using “inverse document frequency (idf)”. Idf decreases if a token appears in more documents, and increases if only a few documents contain the said token. It can be defined as:
(Total documents in our index) / (no. of documents containing the token)
Again, to avoid overly favoring rarity, we will use log of this value.
Combining both the frequency and rarity gives us a way to rank documents now. Here’s what the final ranking algorithm would look like:
// these are retrieved using the inverted index
relevant_docs = set of documents that mention a token in user query
for token in query:
for doc in relevant_docs:
// score for each doc starts at 0
score[doc] += tf(doc, token) * idf(token)
return relevant_docs sorted by score
This is all we need to build a simple search engine. By now, we have addressed all the original problems in our original simple search. That is, we made our search engine faster and added a ranking algorithm to sort the documents by relevance.
Despite the simplicity, the search engine we have made in the post returns good search results. I indexed some hackernews submissions using this implementation, and I was pretty happy with the results. You can try it here.
I have skipped lots of details to keep this short. I hope you learned something from this post, and it encourages you to learn more about search.
Token : string
DocumentID : int
Frequency : int