Crawlers have started visiting URLs with an idx parameter

Hi,

Since a week ago, a news site (https://sisanjuan.gob.ar) with Algolia search has been receiving massive crawler visits to internal pagination pages.

These are some of the URLs from the web server logs:

/ultimas-noticias/itemlist?idx=sisanjuan-gob-ar&p=3&start=8790
/ultimas-noticias/itemlist?amp=1&idx=sisanjuan-gob-ar&p=2&start=11808
/noticias-ciencia-tecnologia-e-innovacion/itemlist/category/34-prensa?amp=1&amp=1&idx=sisanjuan-gob-ar&p=2&start=1136

All of the URLs have the idx=sisanjuan-gob-ar parameter, so I can trace the source of the reference to Algolia. However, these pages are not indexed in the index and they can’t be retrieved from the site search.

To avoid the visits to these pages, I have just added a 301 redirect rule.

¿Do you have some clue how these pagination URLs are being generated with the index parameter and ending in the crawler databases?

Best Regards,
Anibal

Hi Anibal,

thanks for your message!

If I understand correctly, you are using Algolia for a news site and noticed that since using Algolia, your website is massively being crawled.

First, in case you need to share some confidential information, please contact our support by email (support@algolia.com) rather than through the forum.

The only Algolia crawler is the one built for our Site Search feature. This is a paying feature for which we have some legal checks to ensure that the data being crawled can legitimately be crawled.

To help us clarify the issue:

  • Could you detail the link between the URLs and Algolia?
  • Could you provide us with the bot UserAgent?

Best,
Marc

Hi Marc,

  • Could you detail the link between the URLs and Algolia?

The URL parameter idx=sisanjuan-gob-ar is included in these URLs. On our site, the URLs are not generated with this query parameter, and Algolia search service is the only place where I am aware that the parameter could be generated in the URLs. It’s certainly strange that the pagination URLs have this parameter with the name of the index (they are not even indexed by Algolia).

As far as I know, our site and Algolia service don’t generate these URLs. Still, they are being visited by crawlers.

  • Could you provide us with the bot UserAgent?

Several user agents are visiting these URLs, mostly crawlers:

[24/Dec/2019:16:01:14 +0000] "GET /secciones/ministerio-de-turismo-y-cultura?amp=1&idx=sisanjuan-gob-ar&p=5&start=1408 HTTP/1.1" 301 663 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +ht.p://ahrefs.com/robot/)"
[24/Dec/2019:16:01:18 +0000] "GET /ultimas-noticias?amp=1&amp=1&idx=sisanjuan-gob-ar&p=6&start=10182 HTTP/1.1" 301 613 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +ht.p://www.google.com/bot.html)"
[24/Dec/2019:16:01:25 +0000] "GET /ultimas-noticias?amp=1&idx=sisanjuan-gob-ar&p=1&start=2718 HTTP/1.1" 301 613 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +ht.p://www.google.com/bot.html)"
[24/Dec/2019:16:01:27 +0000] "GET /san-juan-en-noticias/2018-11-07/11099-noticias-en-un-minuto/noticias-ministerio-de-gobierno/item/11056-hasta-el-20-de-noviembre-podran-inscribirse-en-las-carreras-de-seguridad/noticias-ministerio-de-gobierno/item/11019-daran-apertura-a-la-direccion-de-regularizacion-y-consolidacion-dominial HTTP/1.1" 301 651 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +ht.p://www.google.com/bot.html)"
[24/Dec/2019:16:01:31 +0000] "GET /secciones/deportes/itemlist?amp=1&idx=sisanjuan-gob-ar&p=6&start=392 HTTP/1.1" 301 635 "-" "Mozilla/5.0 (compatible; SemrushBot/6~bl; +ht.p://www.semrush.com/bot.html)"
[24/Dec/2019:16:01:33 +0000] "GET /secciones/deportes/itemlist?amp=1 HTTP/1.1" 301 654 "-" "Mozilla/5.0 (compatible; SemrushBot/6~bl; +ht.p://www.semrush.com/bot.html)"
[24/Dec/2019:16:01:38 +0000] "GET /ultimas-noticias?amp=1&amp=1&idx=sisanjuan-gob-ar&p=1&start=13512 HTTP/1.1" 301 613 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +ht.p://www.google.com/bot.html)"
[24/Dec/2019:16:01:40 +0000] "GET /secciones/prensa/item/9611-san-juan-avanza-en-la-modernizacion-del-estado-con-nuevas-tecnologias-en-salud HTTP/1.1" 301 496 "-" "Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +ht.p://megaindex.com/crawler)"
[24/Dec/2019:16:01:45 +0000] "GET /secciones/ciencia-tecnologia-e-innovacion?amp=1&idx=sisanjuan-gob-ar&p=1&start=116 HTTP/1.1" 301 663 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +ht.p://www.google.com/bot.html)"
[24/Dec/2019:16:01:55 +0000] "GET /ultimas-noticias/itemlist?amp=1&idx=sisanjuan-gob-ar&p=1&start=8094 HTTP/1.1" 301 631 "-" "Mozilla/5.0 (compatible; SemrushBot/6~bl; +ht.p://www.semrush.com/bot.html)"
[24/Dec/2019:16:01:59 +0000] "GET /secciones/deportes/itemlist/category/14-deportes?amp=1&amp=1&idx=sisanjuan-gob-ar&p=4&start=436 HTTP/1.1" 301 677 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +ht.p://www.google.com/bot.html)"
  • Could you detail the link between the URLs and Algolia?

All of these URLs include the Algolia’s idx parameter, with the index name of the site. As far as I know, our site and Algolia service don’t generate these pagination URLs, with the idx parameter; and the idx parameter is only present in Algolia search queries (fired via Ajax calls).

  • Could you provide us with the bot UserAgent?

Please, check the following log entries:

[24/Dec/2019:16:01:14 +0000] "GET /secciones/ministerio-de-turismo-y-cultura?amp=1&idx=sisanjuan-gob-ar&p=5&start=1408 HTTP/1.1" 301 663 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +ht.p://ahrefs.com/robot/)"
[24/Dec/2019:16:01:18 +0000] "GET /ultimas-noticias?amp=1&amp=1&idx=sisanjuan-gob-ar&p=6&start=10182 HTTP/1.1" 301 613 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +ht.p://www.google.com/bot.html)"
[24/Dec/2019:16:01:25 +0000] "GET /ultimas-noticias?amp=1&idx=sisanjuan-gob-ar&p=1&start=2718 HTTP/1.1" 301 613 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +ht.p://www.google.com/bot.html)"
[24/Dec/2019:16:01:27 +0000] "GET /san-juan-en-noticias/2018-11-07/11099-noticias-en-un-minuto/noticias-ministerio-de-gobierno/item/11056-hasta-el-20-de-noviembre-podran-inscribirse-en-las-carreras-de-seguridad/noticias-ministerio-de-gobierno/item/11019-daran-apertura-a-la-direccion-de-regularizacion-y-consolidacion-dominial HTTP/1.1" 301 651 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +ht.p://www.google.com/bot.html)"
[24/Dec/2019:16:01:31 +0000] "GET /secciones/deportes/itemlist?amp=1&idx=sisanjuan-gob-ar&p=6&start=392 HTTP/1.1" 301 635 "-" "Mozilla/5.0 (compatible; SemrushBot/6~bl; +ht.p://www.semrush.com/bot.html)"
[24/Dec/2019:16:01:33 +0000] "GET /secciones/deportes/itemlist?amp=1 HTTP/1.1" 301 654 "-" "Mozilla/5.0 (compatible; SemrushBot/6~bl; +ht.p://www.semrush.com/bot.html)"
[24/Dec/2019:16:01:38 +0000] "GET /ultimas-noticias?amp=1&amp=1&idx=sisanjuan-gob-ar&p=1&start=13512 HTTP/1.1" 301 613 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +ht.p://www.google.com/bot.html)"
[24/Dec/2019:16:01:40 +0000] "GET /secciones/prensa/item/9611-san-juan-avanza-en-la-modernizacion-del-estado-con-nuevas-tecnologias-en-salud HTTP/1.1" 301 496 "-" "Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +ht.p://megaindex.com/crawler)"
[24/Dec/2019:16:01:45 +0000] "GET /secciones/ciencia-tecnologia-e-innovacion?amp=1&idx=sisanjuan-gob-ar&p=1&start=116 HTTP/1.1" 301 663 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +ht.p://www.google.com/bot.html)"
[24/Dec/2019:16:01:55 +0000] "GET /ultimas-noticias/itemlist?amp=1&idx=sisanjuan-gob-ar&p=1&start=8094 HTTP/1.1" 301 631 "-" "Mozilla/5.0 (compatible; SemrushBot/6~bl; +ht.p://www.semrush.com/bot.html)"
[24/Dec/2019:16:01:59 +0000] "GET /secciones/deportes/itemlist/category/14-deportes?amp=1&amp=1&idx=sisanjuan-gob-ar&p=4&start=436 HTTP/1.1" 301 677 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +ht.p://www.google.com/bot.html)"

Thanks for your reply and details Anibal.
The diversity in agents is intriguing to me.
I’m noticing that most of the provided logs contain ?amp=1; could the issue actually be linked to the AMP implementation?

The idx=sisanjuan-gob-ar parameter generation is not present on our code. So, we have no idea where it is coming from.

Hi @AnibalSanchez,

I’m seeing ‘Googlebot’ in a lot of these queries. We have this documentation on how to monitor and exclude some of these search queries which may be helpful.

The article is giving me ideas for a solution. For instance, to exclude the “idx=sisanjuan-gob-ar” in the robots.txt.

However, I don’t think that these URLs are generated mistakenly submitting search queries.

Thanks for the suggestion.

P.S. Disallow: /?*idx=sisanjuan-gob-ar

HI @AnibalSanchez,

Let us know how that works for you. If you continue to have issues, let us know so that we can investigate further. Another thing that may help is to investigate your logs. You can see those in your dashboard or using the ‘getLogs’ method.

If you need to send private information, you can email us at support@algolia.com

I’ve checked all the available logs, but I still can’t figure out where the crawlers are picking these URLs. They are not in the Algolia index, but they look like they were produced as a search result of the Instant Search widget on our site.

This is a valid search page in the search results:

/busqueda-avanzada?q=gobierno&hPP=9&idx=sisanjuan-gob-ar&p=6

This is one of these URLs with the idx parameter:

/ultimas-noticias/itemlist?amp=1&idx=sisanjuan-gob-ar&p=1&start=8094

Hi @AnibalSanchez,

Can you write to us at support@algolia.com and provide your app ID and url so that we can take a look at your logs for further investigation?

Sure. Thanks you for your help.