Injecting silly value in urls

Hi there,
I’m wondering if there’s something we can do to prevent silly values in get params. If you go there for example https://community.algolia.com/instantsearch.js/v2/examples/e-commerce/?refinementList[brand][0]=ThisIsNotSomethingWeWantToSee&refinementList[brand][1]=Insignia™ You’ll see in the brand section a brand that don’t exists in the index (ThisIsNotSomethingWeWantToSee) One could manually craft nasty urls and spread them easily.

I suspect we should handle the logic to strip bad values, but it seems complicated using react-instantsearch.

Any thought ?

thx

Hey Nicolas, you can use the routing option instead of the urlSync. It’s possible to modify this more so it’s less generic and works for your specific app. Read more in the guide here

I think I’ve added urlsync tags in my post by mistake.

I’m already using something like routing options to rewrite the url like I want, and parse it. But that’s not solving the problem, if bad value is entered in the url.

My url looks like /search?categories=Data,Design, If someone enters /search?categories=Data,F**kYou, I’m screwed, F**kYou will appear in my UI.

With the advanced routing options you can filter out values you don’t want. You can filter out in the routeToState function itself

sure, but I need to know beforehand what are invalid values

What I’m wondering is how users are actually entering these invalid URLs. If they’re manually changing them, wouldn’t it be normal if that shows no results?

What you can do on refinementList for example is to put operator to or, so that if you put a value which doesn’t exist, it will be ignored.

Can you give some examples of real values which cause your search to break?

It’s not a fact of having results or not, or breacking the search. It would be perfectly ok if someone play with the url and have no results

But first of all, ANYBODY can forge a url, and give that to anybody, it’s maybe not your user who change the url, but a nasty spammer, on it’s website, forging bad urls pointing to YOUR website and having them crawled by google bots, for example. So bad urls are indexed by google
(you can search for OPSS030 for example in google, you’ll find a lot of results, leading to a lot of different website, including the one I’m working on). And because the get args are injected as it in the page (well, it’s hopefully stripped out of html), it could lead to bad things.

  1. misleading ads for random websites on YOUR website. I used ThisIsNotSomethingWeWantToSee in my example, but spammers usually use nastier things like “If you want PORN go to xxx.website.com”, and we don’t want that to appear in our website, ever :slight_smile:
  2. then it’s indexed by google (we use prerendering, so google DO index the content of our full client rendered page)
  3. someone type “PORN” in google, and our website is the result list

My main concerns right now is that our pre rendering tool is full of bad urls on our search. They are mostly harmless because they don’t use any of the get arguments that are injected in the page, it’s just costing us more money because we have LOTS of those. I’m going to fix that by prevent pre rendering and caching if the url don’t use correct get args NAMES. BUT if the spammers inject data using correct args names and incorrect values, I’m screwed.

To summarize and answering your question, we already use OR, so non existing facet values are ignored in the search.
Here is a real live url we are using (use desktop, UI is a bit different on mobile):
https://openclassrooms.com/en/search?categories=Design
-> will show a list of online courses about “Design”

https://openclassrooms.com/en/search?categories=Design,Pedagogy
-> will show a list of online courses about Design OR Pedagogy

https://openclassrooms.com/en/search?categories=Design,Nasty%20Strings
-> will show a list of online courses about Design OR Nasty%20Strings (which is not a valid category…)

Look at the ui:
image

Correct strings for categories can change in time, and vary with selected language, so I can’t strip out Nasty%20Strings easily, unless I request Algolia beforehand to know all correct values

1 Like

This is indeed the case, and thanks for begin so thorough. I’m thinking about it, and I can not really obviously see a solution that doesn’t require you to fetch the possible values at first.

Since you are prerendering. Some idea that I have is the following:

  1. detect if you’re prerendering
  2. render
  3. remove everything with count of 0

Note that simply removing a filter with count of 0 (either in the widgets or in the routing itself) may have further implications that need to be investigated.

The removal can be done in routeToState where you’d filter those without results out. While this function is synchronous now, and searching is asynchronous and done afterwards, this is something a solution could be found for (maybe we need to change routing to allow for async returns too)

Is this a possibility for you or am I still somehow missing the point?

I think there’s not easy solutions :slight_smile:

To be more precise, server side rendering is only done for search engine bots (using prerender.io service), so I don’t have SSR for regular users :stuck_out_tongue:

thanks for investigate, anyway!

1 Like