Sustainably search subset of index

Hi there, I have an interesting use case that I can achieve in several ways however none of them seem to be sustainable at scale (this being the sticking point). There was a similar post a few years ago but it seems to have gone stale and is slightly different.

We are building a content library that users purchase a subset of the total library and then need to perform search on that subset. This is a many to many association. We have in the order of hundreds of users and hundreds of peices of content.

  • We are using React instant search hooks v6
  • On the frontend we have access to this data
  • Scalability is of top priority

1: Separate indexes (for each user)
The initial thought was to solve this issue using multiple indexes, one for each user, then programatically creating and managing those indexes.

CONS:

  • 1000 index limit
  • Managing this number of indexes for manual changes (searchable attributes, ranking etc.) is painful

2: GIANT filter
The next thought was to use a GIANT “OR” filter. Because we have access to this information on the frontend, when we do the search request on the front end we can attach every objectID that should be returned - docs link.

CONS:

  • The filter will be a stupidly long string. For N content items the filter will have to include “objectID:5fda0c09-434d-438b-a196-6935a5e34448” (45 chars) N times and " OR " (4 chars) N-1 times. For 100s of items each search to Algolia will have to send this filter of 5000+ characters.

I would appreciate some advice here on if this IS actually reasonable. Instinctively sending 5000+ chars to Algolia for each search seems stupid but perhaps text is small enough that its a non issue.

3: User-restricted access to data
After searching the docs I found a page titled “User-restricted access to data” - docs link. This first paragraph actually describes EXACTLY what I want to do

Sometimes, you don’t want your users to search your entire index, but only a subset that concerns them.

The page suggests to create an array on each object detailing what user / user groups can access each object. With the recommendations of minimising record size for optimal performance I was already concerned however a little testing showed this was un scalable.

CONS:

  • Testing revealed every UUID added 37 bytes to a record. 10kb would be roughly a list of 270 UUIDs, not to mention all of the rest of the record data. This would put a limit of how many users can own each content that we are already at.

4: Data relationships and “distinct”
As I mentioned at the start there was a similar post a few years ago that has gone stale with a few slight differences - discussion link. I want to again stress here that the reason we don’t want to use a relational database here is because we want a full search experience, not just unsorted data that fits the parameters.

The reply on this post was from “anthony.seure” from the Algolia team saying that this was possible via three indexes, data relationships and the “distinct” attribute. Sadly however from reading the documentation suggested it is unclear on how to do this with react instant search hooks.

5: Front end filtering
Assuming there is no way to do this in Algolia, filtering content on the front end is possible. This is what I am currently doing but instinctively it seems ridiculous that a good portion (70%+ ish) of hits are being discarded on the frontend after the network cost associated with sending that data.

Furthermore, going forwards we want to transition to the “AutoComplete” product, and if we are using that filtering on the front end would mean invalid items are suggested.

Conclusion

The end goals are a full search experience of a subset of an index in a scalable and performant manner. Ideally if more information on approach 4 so that could work would be fantastic. Otherwise, I would be very interested what one of these 5 approaches would be recommended, why, and what performance penalty / limitations it may have.