[Japanese specific] How to search for kanji by kana

Hello! I apologize if I’m asking something incredibly obvious, but even after scouring the Internet thoroughly, I was unable to find an answer to my question.

I discovered Algolia a few days ago and absolutely love the user-friendliness of it so far! I want to propose using Algolia instead of Elasticsearch in my company for future projects, but before I can even start thinking about that, I’d need to know how to perform one extremely common task that we do in Elasticsearch for nearly all of our clients.

That extremely common request is being able to find search results containing kanji by actually searching only by the kanji’s reading, either written in hiragana or katakana.

For example, a simple sentence of 「今日はいい天気ですね。」 should be searchable by searching for 「きょう」 or 「てんき」 (in case of hiragana), and 「キョウ」 or 「テンキ」 (in case of katakana), even though these characters are not present in the original sentence at all. In Elasticsearch, I believe we were leveraging the power of Kuromoji plugin to do this sort of substitution.

I was wondering if this sort of text processing is possible in Algolia as well? I can think of two possibilities right now, both with their own drawbacks:

  1. Attempt to convert the entire string from kanji to kana (or both kanas) before storing them in Algolia, and store each string in three separate variants: kanji, hiragana, katakana. I wonder if that would affect the accuracy of search results in case user searches for 「天気」 but 「きょう」 (one term in kanji, one term in hiragana), since each of the three string variants would have different priority/preference configured in Algolia, even though the priority should be the same.
  2. “Teach” Algolia of various kanji reading by uploading a massive list of synonyms. But would that really work?

Are there better options than these two? Thank you so much for answering!

Hi there,

unfortunately, Algolia doesn’t provide any Transliteration feature. Meaning that the Algolia search engine is not able to search using one alphabet if the object has been indexed using another alphabet.

Of course, as you mention you could use our Synonyms feature to add 今日(Kanji)<>きょう(Hiragana)<>キョウ(Katakana) but I assume this might not be enough for your use-case (and we don’t have such massive list on our side)?

What we’ve seen our customers do in the past is using some transliteration tools on their side, and push to Algolia some records with the 3 versions:

 "my_attribute_orig": "今日",
 "my_attribute_kata": "キョウ",
 "my_attribute_hira": "きょう",

Would that work for you?

Hello, thank you so much for your answer!

You’re right, the synonym-based approach would not work very well, as there are many many ways of reading each kanji, and synonyms would not be able to determine the proper context. My bad!

I think the other approach (transliterating the string before indexing it with Algolia) would work, and after reading the documentation further, I can now see that all attributes (in your example, my_attribute_orig, my_attribute_kata & my_attribute_hira can be given the same priority, so I suppose this indeed is the solution for the time being. Thank you so much!

May I kindly ask if there are any improvements related to this planned in the future? I believe I randomly stumbled upon a few articles saying that Algolia relatively recently opened a new office here in Japan, so I suspect expansion to this region is desired and at least based on my very limited experience with the Japanese market, it seems that automatic transliteration and conversion between alphabets would be a highly desired feature. I understand if there’s no (public) ETA at the moment, but just knowing if this is on the “to-do list” for Algolia would be good to know. Thank you!