Scandinavian special characters are not counted as typos

I use Algolia for a product targeting the Scandinavian market. As an example the letters æøå are used in the danish language.

My problem is that these special characters are automatically converted into other characters without being counted as typos. So minWordSizefor1Typo = 6.

Doing a search for ‘nødværge’ which is a correct spelled danish word I get the same number of hits as if I spell the word wrong ‘nodværge’. If I however search for ‘nidværge’ I get different results and Algolia report that the ‘i’ is a typo.

It gets even more problematic in the following example. There is a word ‘modværge’ in danish with a completely different meaning than ‘nødværge’, but because algolia treats ø=o: modværge = mødværge which is only one type away from ‘nødværge’. This implies that we get irrelevant search results.

This is a bug or a feature that I would like to turn off. Can you help?

Hello @jmad,

As you discovered yourself, two factors come into play here:

  1. Algolia normalizes ø as o. While this may sound weird for Danish speakers, it is actually a way to search Danish word from non-Nordic keyboards (as you cannot easily input Nordic characters on a US keyboard, for example).

  2. Typo tolerance is based on the normalized form. It’s the same in languages with diacritics. For example, in French, côté, coté, côte, and cote are treated identically.

There is unfortunately no way to fine tune this behavior at the moment.

Ok. It is just because a lot of the terminology is very similar to elastic search:-)

Thats a shame that you are not allowing users to customize what a special character is.

Could I hack it by using synonyms? Another ideas would be to replace Scandinavian by some weird combo of characters and then translate that in the front end ie ø = 1234aaa. when indexing make that replacement and do the same in the front end when submittig queries and when showing search results. Very ugly though

By the way I just posted this question to support@algolia.com as well. Sorry

@jmad No problem! I’m answering here because it may be of some interest to other Scandinavian users. :slight_smile:

Our back-end is not based on Lucene or Elastic or Solr whatsoever. It is a custom C++ engine that we wrote from scratch. It quite differs from Lucene, notably when it comes to ranking: we use a tie-breaking algorithm rather than a scoring function. More on that in the Concepts > Ranking section of our documentation.

Algolia supports all Unicode characters, so we like to consider that we support all languages out of the box—but we do that in a mostly language-agnostic way. (Although we do have specific handling of Asian languages, especially regarding segmentation.) Now, that may not lead to optimal results in all languages. We are aware of some limitations. The problem you are mentioning (normalization of characters specific to Nordic languages) is one of them.

However, we are not in the process of tackling this specific limitation at the moment. Sorry if I’m disappointing you. :confused:

Dear Clement

Thank you for your reply. A related question is how algolia treats special characters such as § andf $? As far as I understand these characters are ignored? This is very problematic for our use case. Is that right and how to avoid this behavior?

Best

Hello @jmad,

By default, Algolia indexes “letters” (i.e. anything that is normally part of a word, so this includes ideograms for example) and numbers. The rest is treated as separators and is ignored.

However, you may force some characters to be indexed using the separatorsToIndex setting. The dollar sign is a very common use case indeed; the paragraph sign should work just the same.

perfect. Thanks you :-*

Hi

On our store we have a lot of products with æøå in them. If I start the autocomplete with ‘då’ I am getting results that include products starting with ‘dæ’ also. Probably because both ‘å’ and ‘æ’ are normalized to ‘a’. I have read the thread, but I am not sure that I can see a use case where this makes sense instead of just treating those special characters as is, or at least be able to change this via a setting.

Since we have 60,000 products where maybe 20% of these include these special characters we need to know if anything regarding a fix of this is considered before we buy your product.

Can you give us a heads up? Thanks

add this to your settings

'separatorsToIndex': 'æøå',

And you should be fine :slight_smile:

Okay, great. We will try that. And thanks :slight_smile:

Sorry Johan. It does not work as it is supposed to. My advice is flawed. æøå are still normalized.

Please Algolia. We would like to use your product in Scandinavia but it is really difficult when Scandinavian characters are not supported. Any plans on providing an option for characters not to be normalized? Should be an easy fix or what?