Identify similar items in index

I have a website where I am allowing users to create small bits of user generated content on a variety of topics . This content is in the nature of news items on different topics like politics, sports, business etc. that that can then be searched for by other users. Every content created has 4 attributes

  1. topic
  2. title
  3. summary
  4. links
    5 createdAt - (time)

It is quite possible that the similar news content is generated by more than one user. This could result in more than one user creating content on similar news stories. In such case the “title” and “summary” attributes of the these stories would be similar but not necessarily identical for e.g. one user could create a news item with title “Trump says global warming is a hoax” and another user could create a similar news item with title “No global warming - Trump”.

I need a mechanism to be able to identify such similar content created by comparing the “title” or “summary” attributes for every news item so that I can ensure that similar content can be identified and either removed from the index or grouped together. Ideally everytime a new piece of content is generated by a user, I need to be able to check my algolia index and identify all similar news items in the index. Then I want to be able to review these similar news items together and decide whether to keep the new content on the index or remove it.

I needed some assistance on how I can go about enabling this on my algolia index.

Hi @dwivedi.a

I can’t think of a simple way to do this but let me try to suggest something.

Every time there is a new entry you can extract some of the keywords and store them as an attribute: So for example

  • No global warming - Trump would have keywords: [“global warming”, “Trump”]
  • Trump says global warming is a hoax would have keywords: [“global warming”, “Trump”, “hoax”]

Then Algolia can help you group all theses records by attribute.

Do you think this can help?

Hi @Youcef, Thanks for your suggestion. The challenge that I perceive in being able to implement the suggestion is in being able to pick out the keywords for every story and title. This is impossible to do manually and I could not think of a way to do this automatically for each story. I was hoping that given that Algolia runs a full text search there would be a way to identify records where a given attribute (e.g. title or summary) has “similar” text…