A static search for my blog

Posted on Apr 30, 2023

Java
Wasm

More than once I’ve been thinking about a static search for this blog.

This site is using Hugo as a site generator. There are some search solutions for Hugo sites, but I thought it would be a fun project to work on and create a solution for my own.

So I split the problem up in three parts:

Create a JSON-representation of my blog.
Create an index out of this json representation.
Search the index on the client-side.

flowchart LR subgraph js['Site generator'] direction TB Blog(Page content) Blog --> Hugo Hugo -->|generate| JSON1(page.json) end subgraph ig ['Index generator'] IG[Index Generator] JSON1 --> IG IG -->|generate| IDX(index.json) end subgraph se['Search engine'] IDX --> SE[Search Engine] Terms(Search terms) --> SE SE --> |output| Res(Results) end UI --> Terms

The following sections are about the implementation and give insight to my thoughts.

1. Create the JSON representation

The first part is easy to do in Hugo, as it can output content in all different formats - one of them is JSON.

As this output is the input for the index generator, it has to contain all needed data. In my case the output for one page looks like the following:

{
    "name": "",
    "content" : "",
    "description": "",
    "permalink" : ""
}

Hugo outputs an array of this object for all the pages.

The good thing about this definition is, that it would also work for any other page, as long as the input for the index generator has this format. That’s why you can apply this to any other content.

2. Implement the index generator

The index generator is written in Java. I used functional programming (FP) for this. The first reason is, that I wanted to work on my FP skills. The second reason for FP is, that most of the operations are transforming the data.

The output format

Defining the output format declares the interface between the index generator and search engine. It also helps to work out the needed steps to get the output.

In a first version I wanted the output to be the following one:

[
    {
        "key":"wasm",
        "docs":[
            {
                "title":"Optimizing Wasm and JS module size and load time.",
                "url":"/development/optimizing-wasm-js-size/"
            }
        ]
    }
]

Every object contains the key and all the docs that belong to that key. When does a doc belong to a key? The answer is: When it appears in the document. So it won’t be a contextual search (= vector search), but for the beginning it should be fine!

Index optimizations

I have not been thinking much about the format, tbh. Until I wrote the post about optimizing Wasm and JS modules.

I noticed that I’ve been not optimizing the index, yet. So what can be done better, to get a more lightweight index?

One idea is to split up the index into the alphabet and just load the index for the start letter of the search term.

But this will not necessarily make the index thinner. The answer lays within the DRY-principle. In the data structure above, the information for a page is carried in every term it appears. So when a page appears in 50 keys with the starting letter w, the index will contain the doc information 50 times. Such a waste of space and resources!

References can help with that: On top of the index is a list of all pages, which can be referenced by their index (that’s like a foreign-key reference in relational databases):

{
    "docs": [{"title":"Under the surface of WASI in Rust", "url":"/development/wasi-rust/"}],
    "index": {"wasi": [0]}
}

In this example for the key wasi exists one matching document at position 0. Which is the article with the title ‘Under the surface of WASI in Rust’.

The only thing that then needs to be done, is to dereference all this references. But that’s worth it as it saves a lot of space. In the following image you can see in the foreground the indices for some letters. In the front are the indices which are created with the reference structure. In the background you see them with the structure above.

It saves up to 10x of space!

The index generator

The idea for the data flow in the index generator is the following one:

A Page is a representation of a blog and contains one object of the hugo output from the first section.

Pre-processing

First, it gets pre-processed. Which means, that all unnecessary content (like HTML-tags) get removed. Furthermore all content will be lower-cased, to not have different forms of the same word:

public class TextPreProcessor implements Function<Page, Page> {
    @Override
    public Page apply(Page page) {
        String processedContent = page.getContent()
                .replaceAll("<[^>]*>", " ")
                .replaceAll("\\?", " ")
                .replaceAll("\\!", " ")
                .replaceAll("\\.", " ")
                .replaceAll("\\,", " ")
                .replaceAll("\\:", " ")
                .replaceAll("\n", "")
                .toLowerCase()
                .trim();

        return page.withContent(processedContent);
    }
}

Tokenize

The tokenizer splits the text and creates a set of all of them:

public class Tokenizer implements Function<Page, Page> {

    @Override
    public Page apply(Page page) {
        Set<String> tokenizedContentAsSet = new HashSet<>();
        Collections.addAll(tokenizedContentAsSet, page.getContent().split(" "));
        return page.withTokens(tokenizedContentAsSet);
    }
}

Add synonyms

The synonyms fetcher takes every token of the page and fetches all synonyms for this token. With that the index gets more power, as other words with the same meaning can be also used!

public class SynonymsFetcher implements Function<Page, Page> {
    private final SynonymsRepository synonymsRepository = new SynonymsRepositoryImpl();
    @Override
    public Page apply(Page tokenizedPage) {
        Set<String> synonyms = tokenizedPage.getTokens()
                .stream()
                .map(synonymsRepository::getSynonymsFor)
                .reduce(new HashSet<>(), (partialSet, current) -> {
                    partialSet.addAll(current);
                    return partialSet;
                });
        return tokenizedPage.withSynonyms(synonyms);
    }
}

Place page in index

Now the page can be placed in the index. For every token and its synonyms the page is placed in the index.

public class DocPlacer implements Function<Page, DocIndex> {
    DocIndexRepository docIndexRepository = new DocIndexRepositoryImpl();

    @Override
    public DocIndex apply(Page synonymizedPage) {
        DocIndex docIndex = docIndexRepository.getDocIndex();

        Doc doc = new Doc(synonymizedPage.getTitle(), synonymizedPage.getUrl());

        synonymizedPage
                .getSynonyms()
                .forEach(synonym -> docIndex.addEntry(new DocIndexKey(synonym), doc));

        synonymizedPage
                .getTokens()
                .forEach(token -> docIndex.addEntry(new DocIndexKey(token), doc));

        return docIndex;
    }
}

All of them together

Putting it all together the flow looks like the following one:

@Override
    public DocIndex apply(Page p) {
        return preprocess
                .andThen(tokenize)
                .andThen(addSynonyms)
                .andThen(addToIndex)
                .apply(p);
    }

This is just the core function and shall give an idea of how the indexing works. The code to store the index and write it to JSON for every letter is not shown here.

3. Implement the search engine

The output from above can be used to search for all documents that match one or more terms.

As I mentioned above and in another post the search engine loads the JSON-representation of a starting letter after an input has been done. The index and terms are passed to the search engine, which returns the result.

In my case the ‘search engine’ is a Wasm module (written initially in Rust). But any other language could also implement the logic.

What’s next?

The next step is to add the search into my blog. For this I want to automate the index generation.

My blog gets build by a Jenkins job everytime I publish a new post. Within this job I’ll implement a step which re-generates the index.

And while talking about re-generating an index: My posts usually don’t change a lot. It doesn’t take so much resources to generate the index for my blog. But for huger sites it would be better to generate the index incrementally. This will be part of a next version of the index generator.

Furthermore I’ll investigate if there’s a way to create a vector search or at least to use TF-IDF for prioritizing results.

Splitting up the problem into small parts, helps to think and work in a functional way on the problem.

One thing I didn’t mention: My solution may be overly complex. Using different codebases instead possibly one (= JS/TS or Rust) leads to little complexity. Anyhow I learned a lot during this and the result amazes me, as it is working all fine!

There are some related posts I did in the last weeks, even when they don’t mention the static search.

Optimizing Wasm and JS module size and load time Functional Programming in Java