How to best structure this data structure to avoid hitting index size too big issue

Hi, I am working on setting my company’s course catalog index and here is the data I have so far:

As you can see, the credits attr has quite a large amount of data as each course has credits attached and breaking down by profession state and each state can offer multiple credit types and each credit type can link to multiple credit categories.

I currently hit the roadblock of index size too big since courses often times can have more than 50 profession credit states.

I did try to follow the doc to split up the credits into multiple records with the provided Laravel Scout Extended package functionality to add splitCredits method in my Course model just so each course record will have one item under credits attr. The approach definitely resolves the index size issue. But I’ve encountered issues having the facets to show the proper count (I’ve already setup attributeForDistinct to id our internal unique identifier for the course and set distinct to true)

For example, when I set facet credits.status (possible value is either Self Apply or Approved), both facet item counts show a lot more than the total number of courses and obviously that’s because I’ve separate credits into multiple records for the same course and the count I assume reflected that.

So, my question is if splitting the record as I mentioned the right approach? If yes, how do I resolve the count issue to make it count on a distinct course level. Otherwise, any insights on the better way of managing this data structure and be able to easily setup different facets within credits attr for other things like credits.credit_types.name?

Let me know if it’s clear enough.

Thanks a lot!

Hello, in order to understand what’s the issue with the facet counts the best would be to give us access to your index. Please send me a private message with your app id and index name after allowing us access to it. In order to do so, head to https://www.algolia.com/account/support and enable read access :slight_smile:
Thanks!

Alright, I found your account. You should remove your email from your message otherwise your address might be used by bots to send you spam.
Could you provide us access to you application as I described it in my previous comment?

Got it thanks. Having a look at your app. Could you describe an example of a search with a wrong faceting?

Just the overall facet count even without any search applied. There are total of 114 courses in index so far, but since I split it on the credits attr, it has total of 4139 records and each facet count is so much higher. I’ve found that once I applied the setting facetingAfterDistinct to true, I seem to get the correct course counts on each facets.

Does this mean I should attach the setting whenever I make a search to get the correct count?

Also, is it possible for me to get back all the credits related to a course once the search result came back since it’s splitted? The reason I want to do that because I want to show all credits for the course on the front end once a search is done.

Hopefully this makes sense.

Thanks!

When distinct is used, facetingAfterDistinct is in general recommended because you usually want faceting to be computed after the deduplication. The important requirement is that the facets of the documents sharing the same distinct key must be the same as documented here: https://www.algolia.com/doc/api-reference/api-parameters/facetingAfterDistinct/#usage-notes. We don’t fully support distinct with different faceting inside a group of documents sharing the same distinct key yet: the returned facet could be the one of a document that doesn’t appear in the result set.

facetingAfterDistinct cannot be set in the index settings so you must pass it in your search queries.

Faceting shows the facets of the documents in the result set. So if you want to show all the facets corresponding to a single course which is characterized by the distinct id you could perform an extra search without distinct, empty query and filter on id (which is your distinct id) and grab the facets that come out of it.

Does that makes sense?

Yes, it does make sense. I will attach facetingAfterDistinct as a default setting in my search query then so that I can always get the correct count on display.

In regards to my second question about retrieving the course’s all credit profession states, to make it clear, let’s say I search with a term ethics, the result it came back has something like this:

As you can see, the credits attr in the course contains only one profession state object since it’s splitted. So not sure if you’ve already answered the question, but all I want is for this course to return all other profession states under credits. Is it possible?

I believe what you were getting at is if I wanted to retrieve all facets with all facets items and this requires doing a separate query?

Also, since I set distinct on the global configuration level through setting attributeForDistinct, can I still not do distinct when doing search query if I wanted to?

Thanks a lot!

Yes but you need a separate query. Because what you want here is some kind of “mixed” faceting where you have facetingAfterDistinct for all facets but you want all facet values for a specific facets (under credit if I understood correctly).

Since you splitted the docs on this attribute yes. Or you can not use facetingAfterDistinct but it would return all the facets even for docs that are evictect by distinct.

Yes, you can just override the distinct parameter at query time for the second query :slight_smile:

Thanks for the quick response and I really appreciate it!

I might be confused on the term facet, but the credit that I am referring to is the actual content of the course. If I didn’t split the credits, I would have gotten a course with this credits object:

As you can see, it’s an array of objects under the credits attr and not on the facet level. Once my index is splitted on credits, the resulting courses only return one of the credits object. In this case on the second query, how should I retrieve all the objects under credits for the course? Do I have to first pluck all the course ids from the first query after the search, then on second query, do a filter on all those course ids and apply not distinct just so I can retrieve multiple records on the same course then I can get each credits info back for the course?

Sorry if my wording is confusing.

From my point of view this would be indeed one approach to achieve your results. Have you tried it yet? Does it gives the expected outcome?

Thanks for checking in here! I just tried it and realized this approach requires me to do multiple search query without distinct to loop through all paginated results since each course can possibly have 50 records after the split which doesn’t seem ideal.

Wonder if there’s something else more effiicent I could do if all I wanted to retrieve is the course that has all credits attached for my frontend display purposes.

Hi there!

I’ll try to summarize your issue to be sure I understand it well:

  1. You have records that represent courses, and each course has a long list of credits. You need that list of credits to show up when you display records.
  2. Because the credits list is so long, you’re reaching the record size limit, so you decided to index each credit separately and use distinct on the course so that you only get each course once.
  3. The problem you’re facing right now is that you only have a single credit per course, not the entire list, and you’re looking for a solution to retrieve it.

There are two main approaches that I would recommend: keep on using distinct and do a second query to retrieve all credits, or further reduce the credits list.

Keep on using distinct and do a second query to retrieve all credits

This is the approach that you’ve used so far. Its main benefit is that it scales well; however, you need a second query to get all credits. This means that your UI won’t instantly showcase all credits, there will be a short latency.

To retrieve all records in a second call, you can perform an empty search on your index and set a filter on the course identifier. For example, let’s imagine that your records look like this:

[
  {
    "course": "My first course",
    "id": "1",
    "credit": "Credit 1",
    "objectID": "1"
  },
  {
    "course": "My first course",
    "id": "1",
    "credit": "Credit 2",
    "objectID": "2"
  },
  {
    "course": "My first course",
    "id": "1",
    "credit": "Credit 3",
    "objectID": "3"
  },
  {
    "course": "My second course",
    "id": "2",
    "credit": "Credit 1",
    "objectID": "4"
  }
]

You’ve just retrieved the first object (with objectID="1") but now you need to retrieve all credits for that course. Since every record has a course identifier (with the id attribute), you can do a second, empty search with a filter on the id attribute to retrieve all matching records. Then, you can map over them to return a list of credits and display them on your front end.

$index->search('', [
  'filters' => 'id:1'
]);

Note that the id attribute must be set as a filter with the attributesForFaceting parameter, and the filterOnly modifier (for performance reasons.

Further reduce the credits list

Keeping all credits in your records is useful because you don’t need an extra query to retrieve them all. However, you face the record size limit.

I’m seeing that your credits attribute in one of your screenshots is particularly complex; there’s a lot of information in there. With Algolia, you only need to index data that is either useful for search, filtering, custom ranking, or display. If any piece of data doesn’t apply, you should leave it out. Do you need the profession_abbr, or everything that’s under credit_types? I’d recommend reviewing this and seeing whether you can remove data that’s not useful for search. If you trim a lot, maybe you can keep the whole credits list in the record.

Let us know how it goes!

Best,

@sarah.dayan You’ve summarized my use case perfectly! Really appreciate you taking the time to help me out on this.

I see, didn’t even know you could get back your splitted records that way using filters. That’s very good to know.

So, I’ve decided like you mentioned in second approach to remove some of the data that are not useful for search or filtering and I seem to be able to keep the credits without splitting it up. The average record size being shown is around ~7KB now, seems fine but just worried if more data we want to push to the record down the road.

Keeping everything in one record definitely simplified the search a lot and I will keep it this way at least for my initial integration.

Thanks for the useful response!