What is the best way to index more generic HTML documents in Algolia?

There seem to be a few articles in the documentation that reference this, but I’m struggling to understand how these pieces fit together.

These are the insights I’ve managed to gather so far

Tika is a useful tool for document conversion/parsing
Originally referenced in Indexing non-HTML documents | Algolia, but I’ve also seen it mentioned on other articles about HTML document parsing.

Semantic markup is helpful for relevancy
Mentioned in:

The blog post indicates that splitting on header tags can produce some form of section, but it doesn’t really discuss how it’s doing this to produce a hierarchy of sections. Based on language in other documents, I would guess a Tika custom parser, but I can’t be sure.

Long documents (like HTML) is worth splitting into hierarchical sections
Mentioned here, Indexing long documents | Algolia

The direct quote:

Besides, it’s better to avoid indexing much content in a single record, as it degrades search relevance. A better approach is to create small, hierarchical objects based on the structure of the page.

An example

It seems like the best example of this capability is Algolia’s Docsearch. The blogpost mentions this so I imagine that was the natural extension of this functionality.

I would like to be able to index HTML documents in a similar way but I cannot rely on docsearch (or the crawler) as it stands currently, since I get my HTML documents in a slightly different way.

Question

Does anyone have anymore insight into how Algolia’s Docsearch parses generic HTML documents to generate smaller hierarchical records as recommended in their docs?

Or otherwise, is anyone aware of best practices for approaching this type of problem?

I guessed that something like this may be open source, but I failed to find anything online.

Hi @rbhalla – The strategies for parsing and indexing content are multitude!

DocSearch historically used a python scraper called scrapy. You can see it in action here: docsearch-scraper/index.py at master · algolia/docsearch-scraper · GitHub

I think you’re right to focus on using the the structure of the HTML itself to identify document structure and extract relevant pieces for indexing, making sure to break up large chunks of text into multiple records (as mentioned in that page on Long Documents).

Since you’re parsing your documents server side, rather than via a client I would recommend a tool like BeautifulSoup for parsing through the HTML and extracting relevant pieces: Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation

It’s my go-to for dealing with HTML content.

Hi @chuck.meyer!

Thanks for that. I am more curious about how to convert generic HTML structures to hierarchical sections.

I am familiar with BeautifulSoup, but like Tika, it requires you have some general understanding about your document structure to go off of. I was curious to know what strategies/heuristics were involved in that type of parsing.

I will check out the docsearch-scraper repo since that looks promising. Although, still curious to know if this is documented anywhere.

“This is internal magic sauce” is a completely fine answer here too :smile:

Definitely not “internal magic sauce” for sure! But honestly our “sweet spot” use cases (outside DocSearch) tend to skew more toward integrations with ecommerce platforms and backend structured content, which is why you’re not seeing a lot around HTML parsing.

Rule of thumb though, is we assume you are familiar with the structure of your pages – even DocSearch requires you to define a configuration with the structure of your documents. We’re not designed as a general use scraper.

I’ll poke around and see if we have any other blogs/docs that might be useful to you.

1 Like

Ahh that’s very useful to understand. Massively appreciate your help here @chuck.meyer!

1 Like