Exact matching for certain phrases

Hi all,

I’m Vijay, Product Manager for Apartment Therapy and The Kitchn. I am wondering how to handle a specific case where Algolia seems to be searching for two words in a phrase separately instead of as a single string.

For example, a user searches for baked beans. Instead of ranking results where both words are stringed together (i.e. “baked beans”), Algolia seems to be ranking results where baked and beans are searched as separate terms.

The effect is that a user sees a result for Recipe: Oven-Baked Green Bean Fries (because it has both “baked” and “bean”) far above a result for Family Recipe: Baked Beans with Pineapple and Bacon.

Does anyone have ideas of settings I could tweak to solve this?
Thanks!

2 Likes

Hi Vijay!

Thanks for your question!

Tell us, what customizations have you done to the ranking formula?

By default, this behavior isn’t typical.

Thanks,
Jason

Hi Jason,

Thanks for the response. Here are the customizations I have made:

  1. Ranked searchable attributes
  2. Sort by Attribute = True, sorted by publish date in descending order
  3. Typo tolerance = true
  4. Min characters to accept 1 typo = 8
  5. Allow typos on numeric tokens = true
  6. Ignore plural = true
  7. Disable typo tolerance set for certain words
  8. Advanced syntax = true

Note: all of the things I’ve listed here are things that I’ve tried to modify or tweak and then see if this particular search improved. It did not. Here’s what I’m finding:

We are indexing posts on our site. I have the top searchable attribute set to be “categories.” The second attribute is “post title.” A post can have multiple categories assigned to it. In this case, it looks like what is happening is this:

  • when searching for “baked beans” it is ranking posts that have categories of “baked goods” and of “beans & lentils” above posts that have a post title with “baked beans” in the titl
  • because of that, I tried moving “post title” above “categories” in the searchable attribute ranking but this did nothing.

Consider me officially stumped =)

Happy Friday Vijay!

In our Support channel, an engineer answered your question and I wanted to have it here for visibility to the community should anyone else have a similar question:

I can think of at least two directions to help you solve this:

  1. Explicitly search for phrase queries

By default, words from the search query must all appear in a record for it to match, but they need not appear contiguously, nor even inside the same attribute for that matter. (Of course, if they do, they should rank higher—see point #2 below.)

You can however ask for two words to be contiguous by using the advanced syntax (see the advancedSyntax parameter) and surround the expression with double quotes.

If the query string is directly entered by your end-users, however, this may not be possible, so read on.

  1. Ensure proper ranking

By default, Algolia uses word proximity as one of the criteria in the ranking formula, which should ensure that contiguous words rank higher than non-contiguous words. If they don’t, it is likely that some other factor is interfering here. Most likely, your ranking formula contains other criteria before the “proximity” criterion that cause records to rank differently: for example, a sort by attribute, or geo location (if you are using geo search).

Check your ranking formula and let us know if we can be of more help.

Also, check out Ranking Guide to help you better understand how ranking works: https://www.algolia.com/doc/guides/relevance/ranking/

Hi Jason,

Thanks to a tip from your support team, I was able to track down what was going on. I put it in a screenshot here to explain:

It seemed that the crux of the issue was that I had “Sort by Attribute” = TRUE = search_published_at (a value for “publish date”), which meant that newer posts were being forcibly ranked above older posts, even if older posts were a more direct match to the words “baked beans.”

1 Like

Hello Vijay,

I would advise against tweaking the ranking formula in the way you described. The order of the criteria in the default formula has been carefully crafted, and unless you have compelling reasons to do so, it’s better left untouched. For example:

  • Having proximity before words means that, for a 3-word query, two contiguous words will rank higher than three non-contiguous words.
  • Having proximity before typo means that two contiguous misspelled words will rank higher than two non-contiguous correctly spelled words.

In most cases, this is not what better conveys the end user’s intent.

So, instead, I would advise using two separate indices using the master-replica feature:

  • One index (e.g. master) would have the default ranking formula.
  • One index (e.g. replica) would have a sort by attribute at the top-level.

You can read more about this in our Sorting Guide.

Hi Clement,

Thanks for the feedback. Let me try adjusting some of those things and see what happens. We very much value recency (newer posts) so I need to keep that in there somewhere.

In your idea of using a replica index, how would the user experience Master vs. Replica?

The idea would be to switch the targeted index depending on what sorting you need.

Please note that if you put sort by recency first, then textual relevance will only be taken into account if two articles have the exact publication date, which may not be very likely, depending on what is the precision of your timestamps (is it a date + time or just a date?).

If textual relevance matters, then I encourage to place the sort on recency in your custom ranking, which means it will come into play only for similarly relevant records.