How to add a new Japanese word to dictionary

Hi,

After I have evaluated aws cloudsearch, your product is really easy to setup and create a search screen. However, one feature I thought useful with cloudsearch seems missing or I can’t find it. The feature is Japanese Tokenization dictionary, the ability to add Japanese words to the dictionary. Without this, your search breaks up a Japanese word.

For example, こみつ is a brand name for apples in Japan. Because it consists of only hiragana characters and it doesn’t exists in the dictionary, it searches not only こみつ, but こ, み, つ and the search result includes items I don’t want.

I wonder if there is any way to solve this problem.

Thanks, in advance.

Hi Kenji,

I think the best way would be to maintain the dictionary on your side and alter the query if you find a match.

You will need to activate Advanced Syntax in your dashboard and then if you find こみつ in the query, you can add quotes around it to make sure it’s processed as a single word.

You can find more documentation here: https://www.algolia.com/doc/guides/searching/query-expansion/#advanced-syntax-overview

Do you think that would work for you?

Julien

Julien!

It’s very clever!

However, when I tried, it worked on English words like “good health” with quotes but not a Japanese word like “こみつ”. With or without quotes make no difference.

Am I doing anything wrong?

kenji

Hi Kenji,

Sorry to get back to you this late, you indeed discovered a bug! The morphological analysis was performed before parsing the advanced syntax elements, and thus those ones were pretty much ignored. The fix has been written and should be available next week in production!

We still don’t have a way for you to plug your own dictionary, unfortunately.

Sorry for the inconvenience!

1 Like

Thanks!. I will check it next week.

Is there any plan to support the customized dictionary?

The bug fix is good but maintaining our own dictionary and checking to see any word in the user input everytime will slow down the search for sure.

We have a hiragana name for each product name and will use it for the purpose.I believe the count of customized words will be more than 3,000. What database engine do you recommend for the speed?

I’m sure your dictionary includes some of the words already. For example, りんご(generic noun for an apple) seems in your dictionary while こみつ(apple brand name) is not. How can I make sure there is no duplicates in our dictionary?

Thanks for your answer!

Today we are using the ICU dictionnary, you can find a copy there. Lately we worked on adding a bunch of additional words from Wiktionnary. It hasn’t landed yet into production, but should in the upcoming weeks. Note that this still doesn’t cover brand names.

The way the feature is currently implemented make it difficult to quickly add “manage a user dictionary” feature. However we understand that there indeed is a real need there, and I opened an issue in our internal feature tracking system. I can’t give you an ETA however.

Regards,

Thanks for the reply.

I didn’t know such a dictionary exists but it has more than 300,000 words. It’s kind too much for us to maintain it and check our customized words against it although it’s not so hard.

I have a better idea. This might be a very naive thinking but I wonder if you can change your program so that:

if a search word is a Japanese word and doesn’t include kanji characters, that is, all are hiragana or katakana characters, you search only for the exact match, so we don’t need to maintain our dictionary or add double quotes.

I know it’s not a best solution but it might be enough. Not 100% but when a hiragana word is typed, we usually want an exact match, not parsed.

After I wrote this, I realized that I can write a program to take care of this just by adding double quotes. no need to manage or consult the dictionary. I will try after you fixed the bug!

Thanks for your swift reply!

My message was maybe misleading, but this dictionary is what we are using to perform the segmentation, so including it in your front end would only make it redundant.
My Japanese skills are unfortunately limited, but I think some words are composed of both kanjis and hiragana and start with an hiragana letter (お酒, お知らせ, etc. I think). Grouping all hiraganas together would make us fail to segment them because the お would be taken by this grouping.
Once again, my Japanese skills are rather weak, and I wouldn’t be able to make a decision whether this is something that makes sense in the general case or not. I will have to investigate some more.

Regards,

Leo,

It’s getting more interesting because you guys really respond!

All higragra character noun is like こみつ, include no kanji characters. Like you said, when お酒 is searched, we want to search 酒 (by the way, 酒 is a kanji) so your search is working correctly.

I asked my clients about this. They gave me some example which might confuse the search engine:

うに is a word for sea urchin. とうようにくてん is a hiragawa word for 東洋肉店. とうようにくてん includes うに.

But, your search engine won’t parse とうようにくてん into うに, so no wrong result.

I assume you’re doing word parsing like this:

http://atilika.org/kuromoji/

Anyways, the last case shows we should register more synonyms. So, my question is what Japanese synonym dictionary do you use ? I would like use it to avoid redundancies when we register synonyms.

1 Like

Kenji,

You have done a good analysis of the way the feature is implemented.

The issue I wanted to highlight about お酒 and unknown hiragana words (such as こみつ) is the following:
Let’s suppose we have something such as: 『こみつのお酒』in the middle of your sentence. If we group together hiragana characters when we don’t find a word, we would end up searching for『こみつのお』 and 『酒』. Which would be wrong. We will likely have to do something more complex that what we have today: maybe a grammatical analysis to understand that の is here to express possession, etc.
While this is maybe something we will do on day, this is a huge undertaking and I can’t commit and neither if nor when we will do it. Once again, I note that those missing words are a pain, and I will make sure to explore the possibility of adding custom words to our segmentation dictionaries, which is I think a much easier step.

Regarding the dictionary used, this is the one I link before, the one from the ICU library: https://github.com/svn2github/libicu/blob/master/data/brkitr/cjdict.txt. This is list will be expanded in the upcoming weeks with the words we extracted from Wiktionary (I don’t have a public link to the list).

I am happy to answer any further questions!

1 Like

Leo,

The dictionary you pointed is just a dictionary, that is, a list of Japanese words. I need a Japanese synonym dictionary your system is using. Please let me know. I would like to use it to avoid duplicated when we upload our synonyms.

I would also like to know the status of the bug fix.

thanks.

Leo,

As I read the guide, I found this:

Algolia does not provide any built-in synonym dictionary

So, I guess there is none. But, it would be nice to have one for Japanese language.

For example,

豚、ぶた、ブタ

The last two words are pronounced words for the first kanji word so not exactly synonyms but there are so many of them.

I’m jumping in this thread because my team and I are running into the exact same issue. I’ll file a formal ticket but I wanted to add to the discussion here.

Our site has multi-language support with all search running through Algolia, however we are running into some major issues with Japanese.

While we haven’t spent too much time analyzing kanji, we are seeing a good number of instances where people will input common searches that should match record in our database, but zero results are returned because our record is in hiragana while the search was in katakana (or vice versa)

For example, searching チョコ yields 295 hits while ちょこ only yield 11 hits.

With kanji, something like お菓子 yields 20 hits while おかし yields none.

Ideally these terms would return the same, aggregate results.

1 Like

Hi there,

Sorry for the delay, it looks like I missed the notification about your answers, all my apologies.

@kenji First things first, the bug fix for the advance syntax is in production.

Indeed, Algolia is not shipping any dictionary, nor is it doing any kind of transliteration. between scripts. We had seldom Japanese feedback in the past, and none of them where mentioning this as a major issue. As the number of use cases is apparently rising, we are currently thinking about incorporating it to our roadmap.

We don’t have a ETA yet, though.

Regards,

Leo,

I just tested it and it works!

All I need to do is to recognize all hiragana or kataka words in the input and surrounds with double quotes for the search. Although I haven’t tried it yet, I’m thinking to use this library for the purpose:

http://wanakana.com/

Glad to hear it! We are also currently rolling out an improved Japanese segmentation dictionary, that should increase a bit the accuracy of the word segmentation, do hesitate to get back to us if you see obvious, general improvement that could be made.

For the flagging of words, I personally never used http://wanakana.com/, but Tofugu LLC, which is maintaining the library, also have a website called http://wanikani.com that I use as part of me attempting to improve my Japanese: they definitely have data to back up their library!
Let me know how this goes for you! Apart from brand, we should still be able to handle common hiragana words such as きれい, それ, どこ, etc.

@steven.simonitch To go back to your issue, you should have access to a tab allowing you to find the queries returning 0 results. Thanks to this, you could create new synonyms to handle those queries one by one. Depending on the size of the catalog, it may be rather quick to do!

Leo,

Speaking of the word segmentation, I found this is a case of incorrect segmentation:

シンガポールではチリクラブ、タイでは

If you search on ブタ in the sentence above, it hits on ブ、タ which is not correct because there is a zenkaku comma.

I would like to share the tool I’ve used to create entries to my synonym dictionary. Unfortunately, the documentation is in all Japanese and you need to get your api key by creating your account via Japanese forms.

This api will convert a word or sentence which includes kanji characters into all hiragana characters. For example,

漢字 to かんじ

I convert the hiragawa word to a katakana word with a php function and create an two way synonym: 漢字, かんじ, カンジ.

However, there is a problem, which has nothing to do with the api, The problem is if I should add this synonym to the dictionary.

A typical Japanese person types 漢字 if he wants to search for 漢字. He doesn’t use かんじ or カンジ because they may mean something else like 幹事 or 感じ. So, should I not add this entry to the synonym dictionary?

Hi @kenji ,

Thanks for the contribution! This is indeed an interesting API!

Regarding your question, the main issue, as you noticed, is that a lot of words are homophones, and 感じ/漢字 or 初めに/始めに are good examples.
Fortunately there are two cases:

  1. the two words have completely different meanings (the case of 感じ/漢字). If your dataset is a bit specialised, you may not encounter any issue, as one of the two words may never appear in your records, keeping the amount of confusion quite low.

  2. the two words have meanings that are close to each other (the case of 初めに/始めに). In that case, that may not be that bad as the overall idea of the word is kept, and having one for the other would not drastically change the search.

Of course, those are only two contrived examples, and there are potentially thousands of common words that are homophones, and that may create a lot of noise in some cases.
What we still have to keep in mind, is that there is no “synonym chaining”, if you have as synonyms 漢字 <> かんじ and かんじ <> 感じ, that won’t imply 感じ <> 漢字. Let’s take an example:

Records:
[
    {"name": "漢字", "objectID": "A"},
    {"name": "感じ", "objectID": "B"},
    {"name": "かんじ", "objectID": "C"}
]

Queries:
"感じ" -> B, C
"漢字" -> A, C
"かんじ" -> A, B, C

Which seems pretty reasonable to me, as かんじ alone doesn’t allow us to make any assumption on which one is “correct”.

Not that you can also tweak the alternativesAsExact setting to make sure record showing up because of synonym substitution are ranked lower than record matching exactly.

Let me know what you think!

Regarding your question on ブタ, that’s an interesting one. I will check if we are according any importance to the presence of a comma during segmentation (my guess is we don’t, obviously).
I will have a look!