Extract language from html lang attribute

The initial docsearch config for our site docs.contao.org was made in such a way, that the language attribute is extracted from the URL:

However, the Contao docs consist of the developer documentation under docs.contao.org/dev and the manual under docs.contao.org/manual. Only the latter is available in multiple languages and thus only the latter will have the language present in the URL.

This also means that currently the language attribute is missing from the scraped dev docs completely, which makes more difficult for facet filtering. But, both the dev docs and the manual have a lang attribute in their <html> tag. Since this information is already available and it would be more consistent: is there a way to tell the docsearch config to extract the language from the <html lang="…"> attribute instead? I am not sure how the config would need to be changed in order to achieve this.

You can add a custom lang attribute to the selectors object in your config.
Most likely you’ll want to mark it global: true.
As you’re here interested by the attribute lang and not its text content, you’ll need to use xpath, like so:

{
  "selector": "/html/@lang",
  "type": "xpath"
}

I’ve just answered a ticket asking about the ways to create a facet in a docsearch config here: How to set up facets with Docsearch - #2 by Jerska . You might find more information that could interest you there.

Hm, no I am interested in its content. If the attribute is lang="de", then the language variable should be de. If it is lang="en", then the language variable should be en for the indexed page. Though it appears Algolia already does this by default anyway? Even though nothing is configured for the lange for the https://docs.contao.org/dev/URL, the search still works there when using 'facetFilters': ["language:en"].