Experience and advice for dealing with indexing of code in software documentation?

Recently, I noticed that a particular term (gflag) did not appear in my search results for the [technical documentation for YugaByte DB – nothing was returned. When I did a Google search of the docs using site:docs.yugabyte.com "gflag", the results included about 40 results. Google found the term even though originally encoded in Markdown as code blocks using backticks, either within single backticks for inline code (for example, gflag) or within a multiple-line code block with three backticks before and after the code. Here’s an example of a “fenced code block”:

All of the content within
this code block is treated
as monospace.

A search for restrictHighlightAndSnippetArrays in the Algolia API Reference finds only one result, and fails to return the API Parameters page, which has that term inside of a tag.

Based on my reading of issues and comments by others:

  • Code tag (<code>) content is not indexed by default.
  • @Sylvain.PACE has told users that Algolia does not recommend to index code since it will introduce a lot of noise." For example, see: https://github.com/algolia/docsearch-configs/pull/491#issuecomment-404233169.
  • Multi-line code blocks are most often generated to use <pre> tags and may be indexed, unless it has the <code> tag in it too (I found this in one example).
  • Document conventions for most technical documentation say that the “monospace” format is used to “indicate commands, URLs, code in examples, or text that appears on the screen.” (from Oracle)
  • With the popularity of Markdown, the easiest way to generate “monospace” text is to use the backticks. And, because inline backticks result in the text to be wrapped in <code> tags, all too often important terms and phrases are passed over by the Algolia indexing and ignored.
  • As technical writers know, important functions, parameters, options, etc. are included in text, headings, tables, and lists.

I noticed that Algolia documentation has introduced a code snippet convention that doesn’t generate the <code> tags and thus these terms are indexed unless intentionally excluded. But, as the example from the Algolia API Reference above shows, it can be difficult for users to find a term when it is encoded with <code> tags and ignored.

I disagree that code and inline functions, parameters, etc.) introduce a lot of “noise,” especially in software documentation. The failure to index important terms explains why users are often frustrated when searching using DocSearch-indexed documentation searches. Google doesn’t ignore these terms by default — and they deal with a lot more noise than typical Algolia users do.

The docs are weak at explaining what happens to code and other “monospace” terms created using Markdown backticks. Can anyone offer some good advice about how to enable and better manage search within code blocks or inline code?

Thanks,
Steve

:wave: @steveband,

Just for the record, I had a look to your website, it seems that it is not available. Is it expected?
image

I couldn’t find any hosted DocSearch index scraping this website either. Are you running DocSearch on your own?

Using <code/> tags to wrap long snippet of code will introduce some noise since these snippets are really similar from one page to another. It will introduce duplicates and wrongly interfere with the data indexed. However, you can highlight some part of this code using some specific tag that will help the DocSearch crawler to know what are the part important. Indexing these part will be useful for the search experience and you can add them as text selector from your DocSearch configuration. The size of these elements should not be longer than 2-3 lines of textual content to avoid having big chunk of code and dilute the impact of the query’s keywords.

Also, the crawler doesn’t remove every code elements by default. It only does it when code is part of the array selectors_exclude

Sorry about the site being offline (it is online now) — we just started migrating to Netlify and a software engineer inadvertently caused the temporary downtime. I just started at YugaByte recently and we do our own DocSearch index scraping. I am familiar with DocSearch from my previous position, where we used Algolia for index scraping.

I will take a closer look at the selectors_exclude array to get a better sense of its usage with code. For most software documentation sites with large code blocks, they appear to be rendered in HTML with <pre> tags. I found one DocSearch customer that included <code> tags in those sections too, and maybe that was to help reduce some of the “noise.”

  • Is content within <pre> tags indexed by Algolia by default?
  • In the example above, the gflag term was rendered with <code> tags and almost all instances were in text paragraphs. As I understand it, adding code to the text selector will get those code tags indexed.
  • Function names, parameters, and options are often marked with backticks in Markdown, resulting in these terms being rendered within <code> tags in HTML. Is this correct?: If those terms (with <code> tags) are included in headings (for example, h3), then if I want them indexed, I need to add code (possibly with a class name) to the level selector in the indexing configuration file.

Thanks, Sylvain, for helping me (and others) better understand DocSearch!

:wave: @stevebang,

Thank you for the details and the update.

  • Is content within <pre> tags indexed by Algolia by default?

DocSearch index only element matching selectors defined in your config. If preis not defined there, it will not be indexed. We only extract DocSearch meta tags silently.

  • In the example above, the gflag term was rendered with <code> tags and almost all instances were in text paragraphs. As I understand it, adding code to the text selector will get those code tags indexed.

Yes

  • Function names, parameters, and options are often marked with backticks in Markdown, resulting in these terms being rendered within <code> tags in HTML. Is this correct?: If those terms (with <code> tags) are included in headings (for example, h3), then if I want them indexed, I need to add code (possibly with a class name) to the level selector in the indexing configuration file.
  1. It depends of which website generator you are using but most of the time it is the case.
  2. No, you do not, as long as these elements are include within the header <hX/> tags, it will be scraped.

Hope it helps, let us know if you need anything :slight_smile: