Gatsbyjs / phenomic - Algolia site search

Hi

Looking for a way hook up Algolia to pages Markdown pages generated by Gatsbyjs.

I found this for jekyll - https://blog.algolia.com/instant-search-blog-documentation-jekyll-plugin which sounds like it falls roughly inline with what’s needed.

Ideally this would be something done with NodeJS

Here’s a few issues from Gatsbyjs & Phenomic looking for a solution.

Just opening discussion for now - to see what might be the best way to go about this.

1 Like

Hello Tim, I’m Tim :slight_smile:

I did the Algolia Jekyll plugin, so if there’s anything I could help you with in creating a similar plugin for Gatsby let me know, I’d be happy to help.

In a nutshell, I take each page generated by Jekyll (excluding sitemap, rss feed and a few useless ones) and split its content by paragraph. Indexing each page as one raw wall of text is very inefficient in terms of relevance, so splitting by paragraph (and then grouping on a shared key, like the post id) let you have a more fine-grain search, while still only returning each page only once in the results.

To make the parsing easier, I do let Jekyll convert the Markdown to HTML, and I then use Nokogiri (HTML/CSS selection engine) to do my splitting on the HTML content (excluding the layout).

I’ve written a Ruby gem that takes HTML as input and returns JSON objects compatible with Algolia, that include all the relevant title hierarchy information of each paragraph.

I’d love to see the same feature available in other static website generators, so let me know if I can help.

1 Like

Could you provide roughly the json “schema” for the data that is getting pushed to Algolia?

I’m new to Algolia so curious about the shared key i.e. “grouping on a shared key”

Also that article talks about saving the raw paragraph + the sanitized text etc… so thinking if I mimic the same structure you have that would make sense.

Something like this?

[{id:1, key: '12421', txt: '1st paragraph ... ', html: '<p>1st paragraph ...<p>', title: 'Some Title'},
{id:2, key: '12421', txt: '2nd paragraph ... ', html: '<p>2nd paragraph ...<p>', title: 'Some Title'},
]

Thanks @pixelastic

Looks like you note this structure in your html-hierarchy-extractor

{
  :uuid => "1f5923d5a60e998704f201bbe9964811",
  :tag_name => "p",
  :html => "<p>The hero quit his jobs, hit the road, or whatever cuts him from his previous life.</p>",
  :text => "The hero quit his jobs, hit the road, or whatever cuts him from his previous life.",
  :node => #<Nokogiri::XML::Element:0x11a5850 name="p">,
  :anchor => nil,
  :hierarchy => {
    :lvl0 => "The Hero's Journey",
    :lvl1 => "Part One: Departure",
    :lvl2 => "Crossing the Threshold",
    :lvl3 => nil,
    :lvl4 => nil,
    :lvl5 => nil,
    :lvl6 => nil
  },
  :weight => {
    :heading => 70,
    :position => 3
  }
}

===================================

So sounds like this will work

  1. get list markdown files

  2. loop over parse markdown to html

  3. select relavant tags (h1,h2 …, p)

  4. build up json structure for each tag to send to Algolia

  5. Follow API docs to send data in batches to Algolia

@pixelastic

Just FYI - I got a draft version of this running for myself able to parse the html etc… and push the results back to Algolia.

I’ll see later it it makes sense to package something generic up - for now I need to play around with the tags I’m saving back, weights etc…

Thanks for the initial info.:grinning:

Great job, and glad to know you managed to run something similar quickly.

About the JSON structure, Algolia is schema-less so you can actually push anything you want. But based on past experience of other projects that search into HTML content (ie. DocSearch), we’ve found that this format is one that works well.

The important thing to note is that the JSON contains 3 different kind of attributes: attributes that are displayed, attributes that are searched and attributes used for custom ranking. Let me quickly go through them.

text and html are very similar, text being a stripped down version of html. I recommend searching into text and using html for the display. Because html can contains specific markup (classes, tags) of your site, it has the advantage that when displayed, it will re-use any CSS rule your website is already using. It might not fit all designs, though, that’s why you can always fallback to using text for display otherwise.

hierarchy contains the “breadcrumb” path in the page hierarchy up to your paragraph. For example if this specific paragraph is under a h3 that is itself under a h2 and a h1, this will get reflected in the hierarchy. You can use this info for display purposes (indicating to the user the context of the match), but you can also search into those titles.

The weight hash is used for the Algolia custom ranking. What happens when two records match your query? Which one should be displayed first? To handle those cases, I’ve added the heading and position attribute to order them.

  • heading is a score based on the paragraph place in the hierarchy. One that is right under a h1 will have 90, one under a h2 will have 80 and so on. Higher score means more generic paragraph, lower score means more specific.
  • position is nothing more than the order of the paragraph in the page. The first element will have 1, the second 2 and so on.

Something that is missing from the gem and that is added by the Jekyll plugin and should be added by yours as well is the url key. It is used mainly so people can actually click on your results to get to the relevant page (and I’ve also added the anchor attribute so they can actually jump to the closest part of the page). The url is also used for the distinct feature, but I think you already figured that out :slight_smile:

Finally, the uuid is a unique identifier of the record. It is generated from the record itself, so two exactly identical records will have the same uuid. The reason I added this was because in the very first version of the Jekyll plugin, everytime I re-indexed a Jekyll website I was deleting then pushing all records again. Even those that didn’t change. And it was quickly killing the number of operations available on the free plans. Now, with this uuid and the lazy_update feature, the plugin tries to be smarter and do a diff of the local set of records and what is already in the Algolia index and only add/remove the relevant ones.

Have fun playing with it :slight_smile:

1 Like

Fantastic thanks for breaking all that down further. In particular after looking at your example schema I wasn’t sure how hierarchy was being used. My initial work on this is for the OS project - https://reactfaq.site (might be a fit for docsearch).

I’m finding overall because it’s mostly a link site at this point weighting based on paragraphs likely doesn’t make sense. I’m playing around with the idea of weighting based on tag type H1,H2… , strong, li and a being the main ones. More weight given to the heading + strong. Will probably make sense to work in as you note about how your scoring the ‘heading’. Not sure how well that will work out appreciate your thoughts if you have any on that.

Noted on the url - thought of that last night.

I’ll dig into the lazy_update feature thanks for pointing that out.

As you said, your website is more of a link website than a content one so the split by paragraph might not be the best fit. Actually, parsing the output HTML to get the content might be the wrong approach. It has some great strengths, but it also has one main weakness.

The main thing to keep in mind is that, in search, if you want to provide a great relevance, you need to feed your engine with data (weight, priorities). And when you start going the HTML-parsing way, you’re limited to the data that is actually available in the HTML markup. From there, I see two solutions:

Solution 1: Adding all needed data to the HTML markup

You could wrap each of your link into a specific HTML tag, something like <span data-priority="3"> and parse this data-priority attribute to add some weight to your indexed links.

Solution 2: Build a separate data-to-algolia pipeline

Instead of generating your website from markdown files, and then parsing the HTML to push to Algolia, you might want to first extract your relevant data (the list of links) into an easier to parse format (like a JSON file).

Then, you can use this data file to generate the website (it’s a pretty common pattern; Jekyll as collections, Middleman has data for that). And you can also use the same file to push records to Algolia. This file could be as simple as a JSON object with keys for each categories, containing a list of links with name, url and weight for each

Second solution is more work, but will also give you more flexibility. That being said, if you have a small (< 100) number of links, you might not need custom weights as all your links will already be curated and people will usually find what they’re looking for with keywords only.

1 Like

I ended up getting DocSearch hooked up which works well for my initial use case.

Solution 2 could have been achieved fairly easily with the direction I was headed using this as a parser (https://github.com/cheeriojs/cheerio). I need to look into it further to see how well it deals with hierarchy.

My initial code was just grabbing all the <li>, <a> etc… and ranking based on the

  • or number vs position on the page. Position on the page makes more sense (like what you setup).

    I do plan on doing more work on static sites which will have more typical

    type content in the future. Thanks for all the valuable information / insights on how to approached that.