Docsearch / Topic based authoring - How to work with several sets of meta tags per page


we write our documentation in DITA topics (one adoc file per concept, reference or task). Each topic has meta tags for description and keywords, which are by default written as meta keywords/description into <head> when converted to html. However, those topics are ‘bundled’ for web view, because one ‘.html’ per topic would be too fine-grained and result in an endless amount of clicks for the user.

So with the process of ‘bundling’, the meta tags keywords & description of all but the first topic first disappear. Their <h1> titles become <h2> titles, and below the <h2> we bring back the keywords as plain text, wrapped in the .keywords class with display:none; and the same for the description text, in a .description class with display:none.

As an example, the structure of a topic-bundle may look like this (closing tags per line omitted):

<h1> Concept 1
<p> Preamble text
<h2 id=“Reference1”> Reference 1
<p class=“keywords”> Keyword1, Keyword2, Keyword3
<p class=“description”> Description Text of Reference 1
<p> Reference 1 content
<h2 id=“Task1”> Task 1
<p class=“keywords”> Keyword4, Keyword5, Keyword6
<p class=“description”> Description Text of the Task 1
<h2 id=“Task2”> Task 2
<p class=“keywords”> Keyword7, Keyword8, Keyword9
<p class=“description”> Description Text of the Task 2
<h2 id=“ConceptB”> Concept B
<p class=“keywords”> Keyword10, Keyword11, Keyword12
<p class=“description”> Description Text of Concept B

I am running the docsearch scraper locally and assume that the best method for indexing these extra sets of keywords is adding them to the search configuration:

"lvl0": ".doc h1",
"lvl1": ".doc h2",
"lvl2": ".doc .keywords",
"lvl3": ".doc .description",
"lvl4": ".doc h3",
"lvl5": ".doc h4",

However, I am not sure how to ‘combine’ lvl1 with lvl2 and lvl3. As an example, let’s assume that keyword4 also appears in the h2-Title of Task 1. Keywords are often part of the title, but also include synonyms. If the user searches for “keyword4”, Algolia will give a result of lvl1, and lvl2 and lvl3 will return ‘null’. So far so good, the user can guess from the title if the result matches his intent.

But to improve the result, what we actually want to show the user when he searches for keyword4 (or a synonym) is: the heading h2 (with anchor, lvl1) and the description (lvl3) of the topic. Not the matched keyword.

And therefore I would be very thankful for a hint of how to tell the scraper to index the additional meta information per anchor as part of the search result, and how to display the information correspondingly in docsearch.js or instantsearch.js. It could work that I swap lvl1 and lvl3, ie. place description and keywords before the anchor h2. But then a click on the search result would not point to the correct anchor. On the other hand, I could omit the entire description and pre-process the search results somehow. I found transformdata - is there a way I could use it to scrape the lvl3 description class after the lvl1 search result?

:wave: @mx2

Thank you for reaching out and the details.

If I understand correctly your issue, you want the UI to display extra contextual informations from the description related to a specific keyword while not displaying this matching word. Unfortunately DocSearch was not design for this use case since we wanted our user to understand why the displayed hit is matching the query. Lvlx and text attributes are always retrieved and displayed.

Given your issue, I would recommend you to edit the default strategy from the DocSearch scrapper. You will need to add a new attribute to the record built such as keyword. This keyword attribute will be a searchableAttribute while not being an attributesToRetrieve since you do not want to display this information.

You might also be interested in trying out Enterprise crawler that will help you to define your own strategy to parse the content of your webpage.

I would definitely recommend you to add this logic before sending record to Algolia. The search engine’s work will give you more flexibility and it will help you to avoid doing the same operation over and over again. Please note that the transformdata method run on the end-user side once the search engine yields results. Using this method for rearranging your results ends up with less accuracy as you will only process a small set of hits yielded by the search engine. These results are only a subpart of the whole dataset/index. Relying only on transformdata will not leverage all the full power of Algolia and introduces avoidable ressources usage.

Hope these hints would help you to make the most ouf of Algolia :rocket:

Let us know if you need anything,


Hello, thank you for replying. I kind of ended up in that direction, even though it all appears to be a bit of a hack, it does work:

instead of scraping the final website with the combined topics, I convert the topics individually from adoc to html first, and scrape them with scrapy directly to create JSON records without using the docsearch scraper. I pass the hierarchy, URLs and some other specifics that I need in the final records to scrapy as arguments.

So finally, I can combine three asciidoctor text documents to one html page with Antora, while keeping :keywords: and :description: meta tags from each topic “alive” in the search records.

To display them, I had to switch to autocomplete.js in order to display the additional metadata in the search result dropdown. If anyone is interested to see that in action, I’ll post a link once it’s live.

Thank you for your feedback and the follow up.

Happy to see it live yes :slight_smile: