Crawler not detecting content in web component shadow DOM

Hello -

I’m having trouble with the Algolia Crawler detecting my content. I have a web component that renders a table of information, sourced from an external JSON file, the final output is a table rendered in the shadow DOM of the web component. i.e. <custom-element file=""></custom-element>

Examining the indexed records, the content returns as empty on these pages; these particular pages only have the web component, therefore it’s thinks there’s no content. I configured the plugin in my netlify.toml to have renderJavaScript = true but still no luck.

Is there any way to hint to the crawler to collect what’s in the shadow DOM, perhaps there’s a timing issue at play and the crawler isn’t waiting long enough for the web component to fully render? If you have any other suggestions, I’m all ears.

Thank you :slight_smile:

Hey @davehudson52

Wonder if you have looked at the renderJavaScript documentation, specifically waitTime?

Hi @davehudson52, could you share an example of URL with this behaviour so we can check?

@the Indeed it could help if it’s confirmed to be a timing issue, but this parameter is not exposed to the Netlify plugin. Let’s see what the issue is first :slight_smile:

1 Like

Hello -

Thank you for the replies!

Here is a functional link, with the page I’d like crawled and indexed:

If you inspect code on the Breadcrumbs page default tab/API tab, you’ll see a <customelement-manifest-element> web component that renders a table out in its shadow root:

Here’s a screenshot of the contents of this page according to index record; none of the web component content is captured in record:

If you click on Demos tab from the Breadcrumbs page, there are demos that are loaded via JS, and these are being correctly indexed, it’s worth noting the demos are generated via web component, <docs-demos>, but one created without a shadow DOM, therefore this content is in the light DOM:

Screenshot of ‘Demos’ tab record:

I’ll take a shot at playing with waitTime, maybe a buffer will give it time to detect additional content.

Thank you both for the help and suggestions!

Hi Dave, so after a few tests it seems that you are using too advanced tech hehe:

  • Firefox 92: Doesn’t work, I see the following error in the console: Uncaught SyntaxError: unexpected token: identifiercustomelement-manifest-element.js:7:55
  • Chromium 93: Doesn’t work: Uncaught TypeError: "css" is not a valid module type.
  • It works with Chrome 93 or Chromium 96 though.

We are using Puppeteer for the renderJavaScript feature, and the latest version uses Chromium 93:

So it will eventually work when Puppeteer is updated to a greater Chromium version, but personally I think it would be great to make it available for Firefox users too :fox_face: :slight_smile:

Thank you for helping me troubleshoot this issue, I do appreciate the help!

  1. Web component (WC) demo comparing light vs shadow DOM

    I added 2 different web components that are on every tab, to see what the results look like after being crawled and indexed:

    • <test-light-dom-wc>

    • <test-shadow-dom-wc>

      • This element has a shadowRoot attached to it encapsulating the text string rendered into the shadow DOM

    DOM rendered on page:

    In the indexed record only the non shadow DOM content was added (after WC content is the actual tab content):

  1. Making a guess

    A complete guess here but I wonder if the crawler is not reaching into the shadow DOM to retrieve the contents? i.e. element.shadowRoot.retrieveThingsPlease()

  1. Does netlify.toml only support renderJavascript = boolean?

    The documentation (link above) shows type: boolean but also links to docs that support Type: boolean | string[] | object, does the Netlify plugin only support a boolean value in the netlify.toml file?

    Just a hunch, this probably won’t fix issue, but figured why leave a stone unturned.

  2. Site tech

    @sylvain.bellone totally valid points, thanks for pointing out! We’ll have to do some refactoring, as even if the Algolia plugin supports capturing shadow DOM content, Puppeteer will break.

    Link to preview latest changes:

Thanks again for your time!!

First of all thank you for the help, I really appreciate it!

  1. Advanced tech

@sylvain.bellone totally valid points! We will have to do some refactoring for better support of users and Puppeteer; thanks for pointing out!

  1. Web component demo

    Here is the latest deployed preview:

    To get a better understanding of what might be happening I added 2 new and simple web components (WC’s) that render text strings to the page, the WCs are on every tab now.

    The WCs added were:

  • <test-shadow-dom-wc>

  • <test-light-dom-wc>

    • No shadow root here, simply renders out text string

      Page DOM render:

      Indexed record example:

      Notice that the record only has captured the <test-light-dom-wc>'s text, ‘test-light-dom-wc’, content is in the light DOM, and is missing the other WC content.

  1. Potential problem

    I wonder if the crawler isn’t currently setup to retrieve items in an element with a shadow DOM? i.e. element.shadowRoot, this would pierece the encapsulated shadow DOM.

  2. Does netlify.toml renderJavascript only support boolean values?

    In the documentation it looks this way, but just wanted to confirm, since the docs also link out to another page that allow for more robust config renderJavascript: (Type: boolean | string[] | object)

    This issue at hand probbably wouldn’t be fixed by waitTime buffer but figured why leave a stone unturned.

    Thank you for your time and input!!

Hi @davehudson52, sorry I wasn’t notified of your answers!

Notice that the record only has captured the <test-light-dom-wc> 's text, ‘test-light-dom-wc’, content is in the light DOM , and is missing the other WC content.

I don’t see where you want to go here. As previously tested, it indeed doesn’t work with Chromium 93, which is what we currently use to render the JavaScript.

I wonder if the crawler isn’t currently setup to retrieve items in an element with a shadow DOM? i.e. element.shadowRoot , this would pierece the encapsulated shadow DOM.

Same answer that above I think. If I understand correctly, shadow DOM needs JavaScript rendering, but doesn’t work with Chromium 93.

Does netlify.toml renderJavascript only support boolean values?

Yes only boolean, it doesn’t offer the same granularity than the standard Crawler indeed. But it wouldn’t change anything to wait more, once again, it’s simply not supported in the Chromium version we use currently.

Hi @sylvain.bellone - that’s alright, thanks for following up regardless.

There’s 2 issues at play here:

  1. CSS modules being unsupported by Chrome 93 / Puppeteer as is
    This is reality we have to accept, obviously nothing can be done other than refactor or wait. However, only one of our web components is relying on a CSS module import, and the other WC’s content are not being crawled.

  2. Web components with a shadow DOM / shadow root attached are not crawled
    I should’ve specified this in my reply above better, my apologies. WC’s that have a shadow root, the content in the shadow root is completely ignored by the crawler, I don’t think the crawler has been setup to account for these types of custom elements. I think this type of enhancement would be fantastic, considering how much more common place web components are becoming.

Check out the example again if you don’t mind, WCs w/o shadow root (in the light DOM) are properly crawled, WCs w a shadow root (in shadow DOM) are skipped.

Let me know if you need additional info, and thank you :slight_smile:

It’s not available anymore, can you redeploy it?

Spun up a new demo here… WC preview

Added 2 WCs:

  • ‘test-shadow-dom-wc’, renders text to a shadow root, isn’t crawled by Algolia
  • ‘test-light-dom-wc’, renders text w/o shadow root, test is indexed correctly

Ok thank you for that, so I confirm, shadow DOM is not supported by the crawler, sorry!

hi, @sylvain.bellone

Does crawler open source? we can grab the ShadowDOM content like this way:

Support Chrome 90

Hi @mantou, the crawler is not open source but its JavaScript rendering engine is. It is based on Puppeteer: GitHub - algolia/renderscript: A custom JavaScript rendering engine based on puppeteer