Text chunk records and facets

Hello!

I am indexing transcripts of interviews, and they are very long text documents. As recommended in your docs, I’ve broken each transcript up into smaller chunks for the index. Each interview has an ID, so chunks from the same interview all have the same value for the “interviewID” field.

Each interview, the “parent” of the transcript chunks in a way, has data for categories, participants, location, date etc. of the interview.

Say I am faceting on city of the interview. I have 1 interview from Brooklyn. Well, since the Brooklyn interview was broken in 344 chunks, the facet on city for Brooklyn has the number 344 next to it. This is highly misleading, as here is only 1 Brooklyn interview.

I am puzzling how to best organize the data. If I did not chunk the transcripts this would not be an issue but they are very long. Thanks for any ideas!

Hi @data you can use the facetingAfterDistinct parameter to change the counts.

Hi, thanks for your reply! I read the documentation of facetingAfterDistinct.
The different chunks are distinct and have different content, they just share the same interviewId.
Can you explain any more on how I can change the counts with facetingAfterDistinct? I appreciate it.

Here is some fake but representative data from my project:

Transcript chunks:

[{
objectId: "chunk111",
content: "Paragraph 1...",
interviewId: "brooklyn123",
type: "transcript",
date: "2018-02-04",
topics: ["family", "sports", "health"]
},
{
objectId: "chunk112",
content: "Paragraph 2...",
interviewId: "brooklyn123",
type: "transcript",
date: "2018-02-04",
topics: ["family", "sports", "health"],
},
{
objectId: "chunk222",
content: "Paragraph Different...",
interviewId: "greensboro123",
type: "transcript",
date: "2019-05-04",
topics: ["school", "politics", "health"],
}]

Interviews:

[{
objectId: "brooklyn123",
date: "2018-02-04",
topics: ["family", "sports", "health"],
type: "interview",
},
{
objectId: "greensboro123",
date: "2019-05-04",
topics: ["school", "politics", "health"],
type: "interview",
}]

Hi @data I assume you are using the distinct parameter and the attributeForDistinct option for deduplication. Adding the facetingAfterDistinct to your search request, should change the counts on the facets for you. If you are using one of our instantsearch libraries, you can add the facetingAfterDistinct to the configure widget as a search parameter.