Great job, and glad to know you managed to run something similar quickly.
About the JSON structure, Algolia is schema-less so you can actually push anything you want. But based on past experience of other projects that search into HTML content (ie. DocSearch), we’ve found that this format is one that works well.
The important thing to note is that the JSON contains 3 different kind of attributes: attributes that are displayed, attributes that are searched and attributes used for custom ranking. Let me quickly go through them.
html are very similar,
text being a stripped down version of
html. I recommend searching into
text and using
html for the display. Because
html can contains specific markup (classes, tags) of your site, it has the advantage that when displayed, it will re-use any CSS rule your website is already using. It might not fit all designs, though, that’s why you can always fallback to using
text for display otherwise.
hierarchy contains the “breadcrumb” path in the page hierarchy up to your paragraph. For example if this specific paragraph is under a
h3 that is itself under a
h2 and a
h1, this will get reflected in the
hierarchy. You can use this info for display purposes (indicating to the user the context of the match), but you can also search into those titles.
weight hash is used for the Algolia custom ranking. What happens when two records match your query? Which one should be displayed first? To handle those cases, I’ve added the
position attribute to order them.
heading is a score based on the paragraph place in the hierarchy. One that is right under a
h1 will have 90, one under a
h2 will have 80 and so on. Higher score means more generic paragraph, lower score means more specific.
position is nothing more than the order of the paragraph in the page. The first element will have
1, the second
2 and so on.
Something that is missing from the gem and that is added by the Jekyll plugin and should be added by yours as well is the
url key. It is used mainly so people can actually click on your results to get to the relevant page (and I’ve also added the
anchor attribute so they can actually jump to the closest part of the page). The
url is also used for the
distinct feature, but I think you already figured that out
uuid is a unique identifier of the record. It is generated from the record itself, so two exactly identical records will have the same
uuid. The reason I added this was because in the very first version of the Jekyll plugin, everytime I re-indexed a Jekyll website I was deleting then pushing all records again. Even those that didn’t change. And it was quickly killing the number of operations available on the free plans. Now, with this
uuid and the
lazy_update feature, the plugin tries to be smarter and do a
diff of the local set of records and what is already in the Algolia index and only add/remove the relevant ones.
Have fun playing with it