proximityDistance issue with an array

Hi,

We might have found a bug around the array ranking.
Let’s take the following objects:

[
  {
    ObjectId: '1',
    keywords: ['foo', 'bar'],
  },
  {
    ObjectId: '2',
    keywords: ['foo', 'cat', 'bar'],
  },
]

and the following configuration:

{
  removeWordsIfNoResults: 'none',
  searchableAttributes: ['keywords'],
  ignorePlurals: true,
  typoTolerance: 'min',
  attributesToHighlight: ['*'],
  attributesForFaceting: [
    'searchable(keywords)',
  ],
  customRanking: [],
  removeStopWords: false,
}

Searching for the foo bar query will return ObjectId=2 first then ObjectId=1. Both results have the same proximityDistance=8 while the documentation makes me wondering if this ranking criterion is working at all. We expect the opposite output order as maximising the proximity.

proximityDistance (integer): When the query contains more than one word, the sum of the distances between matched words (in meters). Corresponds to the proximity criterion in the ranking formula.

Thanks for your time!

Hi there,

This result may look surprising but is indeed the expected result.
We are actually taking additional precautions to ensure that the arrays are not subjected to proximity.

Basically, 8 is the cutoff for the proximity: if two words are separated by at least 8 others, the proximity is just set to 8, and that’s what we are doing for all the elements of an array, to not take proximity into account.
If proximity is important in your use case, then you should add an additional field, with those keywords in a single string, and search on this string.

Let me know if this answers your question!

Regards,

If proximity is important in your use case, then you should add an additional field, with those keywords in a single string, and search on this string.

We have also tried to concatenate the keywords into a single string. It’s working much better (still not perfect, we need two keywords to change the ranking) but then, we hit another wall. The result of the queries are no longer permutation invariant. Meaning that searching for foo bar and bar foo will not return the same output.

Any idea how to get permutation invariant results while maximizing the proximity?

To add some context, aside from the keywords concatenation and the ordered array, we have also tried two other solutions:

  • the weight pattern in Nested custom ranking
  • the k searchableAttributes solution with an object like the following:
{ ObjectId: "1", "k1": "foo", "k2": "bar", "k3": "cat", ... }

None of the 4 previous pattern works. I’m starting to think that Algolia only cover trivial use cases. Unless we find a solution, we are going to stick to Elasticsearch.

Hi again,

The thing with ignoring the permutations is that it is based on proximity as well:
if two results are not in the right order, the proximity is increased by one.

I see two ways of fixing this:

  1. Use minProximity=2 in the settings. That will consider all proximities of 1 or 2 to be equal, in that case, the permutation won’t affect the results. However, as a side effect, if two words are separated by only one, they will count as adjacent.

  2. Generate a more complex tag attribute. This one is likely to increase the size of your records by a lot, if you have a lot of tags. The idea is the following:

"foo bar" -> ["foo bar", "bar foo"]
"foo bar baz" -> ["foo bar", "bar foo", "bar baz", "baz bar"]

This way you still have a boost on adjacent words, without being impacted by permutations. The drawback is that it’s a bit more work to generate those permutations, and you won’t have a difference between tags separated by one or two words. They will only be considered as “not adjacent”.

Your use case seems pretty peculiar (wanting proximity without permutation), though. Would you mind describing it a bit more ? Maybe there is an entire other way to solve this.

Regards,

Your use case seems pretty peculiar (wanting proximity without permutation), though. Would you mind describing it a bit more ? Maybe there is an entire other way to solve this.

I agree, let’s describe what we are looking for.

The input
Each record has a list of keywords sorted by importance.
Let’s say keywords: ['foo', 'bar', 'baz', ...].
In our example, foo is more important than bar.
The median length of this keywords array in our dataset is around 20 keywords.

The output
We want to display the most relevant records first:

  • A record with 2 matching keywords is more relevant than a keyword with one matching keyword.
  • If two records have the same number of matching keywords, we move to the next criterion.
    The lowest the sum of the indexes is in the array, the most relevant the record is. For instance, for the query bar baz and the above input record the sum equals 3 (index of bar = 1 + index of baz = 2).
    Similarly, this record keywords: ['bar', 'foo', 'baz', ...] has a sum that equal to 2. So it’s more relevant than keywords: ['foo', 'bar', 'baz', ...].

Following those rules makes the output “query permutation independent”. You have the same output by querying bar baz or baz bar.

So, from my understanding on the situation, this is not something Algolia support.
Do you know if there is some story in the ROADMAP that we could take advantage of in the future?

Indeed, Algolia doesn’t support query time sorting: you can’t change the sorting strategy depending on what is matched by the query. We can try however we want, we won’t be able to match your exact requirements, and we are maybe not the right tool for your use case.

That being said, we really rarely encounter such strong requirements, and while it may be perfectly legitimate in your use case, sometimes in practice an extremely complex strategy may be disturbing for the end-user.

There is something I don’t understand about your expected: let’s suppose I have two records:

  • ["foo", "bar", "baz"]
  • ["bar", "foo", "baz"]

Which record should be first when the query "foo bar" is sent, what about when "bar foo" is sent? Considering the permutation should not have any impact?

Both records will have the same score for the number of words tie breaking and the sum of the positions tie breaking. I would expect the next tie breaking criterion to discriminate between the two records: so desc(ObjectId) as it’s the last tie breaking used by Algolia.

Going back to very first solution, it looks like it rather well satisfies your use case, except for the proximity that is not taken into account, but there is no mention of this in your description of what you need:

  1. use arrays: "foo bar baz" => ["foo", "bar", "baz"], "bar foo baz" => ["bar", "foo", "baz"]
  2. Put this field as Ordered to prioritize early matches in the array

This way, searching for:

  • “foo bar”, yield the two records with the same score.
  • “bar baz”, yield the second record first, because bar is second in the first record and first in the second one.
  • “baz foo” yield the first record first, because foo is first in the first record and second in the second one.

Is there something I am missing?

Is there something I am missing?

Your example do work with 2 records. Now, let’s add a third record or more:

[
  {
    "objectId": "1",
    "keywords": ["foo", "bar", "baz"]
  },
  {
    "objectId": "2",
    "keywords": ["bar", "foo", "baz"]
  },
  {
    "objectId": "3",
    "keywords": ["foo", "baz", "bar"],
  }
]

I see what you mean, indeed, [“foo”, “bar”, “baz”] is not guaranteed to be before [“foo”, “baz”, “bar”], because Algolia is only retaining the position of the matching item, and not the full list of position.

We are going to use this non ranking behaviour to take advantage of a custom ranking.
In the end, I wish the ranking behaviour was better documented. It’s a strength and a weakness at the same time depending on the use cases.
A weakness when you want to rank strictly by the search attributes.
A strength when you need some space to use a custom ranking.

Thanks.