There seem to be a few articles in the documentation that reference this, but I’m struggling to understand how these pieces fit together.
These are the insights I’ve managed to gather so far
Tika is a useful tool for document conversion/parsing
Originally referenced in Indexing non-HTML documents | Algolia, but I’ve also seen it mentioned on other articles about HTML document parsing.
Semantic markup is helpful for relevancy
Mentioned in:
- Indexing non-HTML documents | Algolia
- Building search for documentation — Laravel doc search | Algolia Blog
The blog post indicates that splitting on header tags can produce some form of section, but it doesn’t really discuss how it’s doing this to produce a hierarchy of sections. Based on language in other documents, I would guess a Tika custom parser, but I can’t be sure.
Long documents (like HTML) is worth splitting into hierarchical sections
Mentioned here, Indexing long documents | Algolia
The direct quote:
Besides, it’s better to avoid indexing much content in a single record, as it degrades search relevance. A better approach is to create small, hierarchical objects based on the structure of the page.
An example
It seems like the best example of this capability is Algolia’s Docsearch. The blogpost mentions this so I imagine that was the natural extension of this functionality.
I would like to be able to index HTML documents in a similar way but I cannot rely on docsearch (or the crawler) as it stands currently, since I get my HTML documents in a slightly different way.
Question
Does anyone have anymore insight into how Algolia’s Docsearch parses generic HTML documents to generate smaller hierarchical records as recommended in their docs?
Or otherwise, is anyone aware of best practices for approaching this type of problem?
I guessed that something like this may be open source, but I failed to find anything online.