Spliting long text to records

We are using searchplus plugin for craft cms and during error indexing, we got “too big” error.
Now, this is due to very big body text.

Now suggestion from Algolia is to break big records into chunks, but there is no good explanation how to do this. So should I basically for every paragraph in content make another record in same index, with rest of the data duplicated?

Could somebody direct me to an example of how to split very long text into smaller records?

1 Like

Hi @janic,

Indeed, the only guide we have on the subject is this one: https://www.algolia.com/doc/guides/ranking/distinct/#distinct-to-index-large-records

Then the splitting implementing depends on the language you are using.

Could you tell us the language you are using? In some of them we have some tiny libraries to help you split the records.

Silly me, it looks like you are using PHP :wink:
Let me bootstrap you a quick example ASAP.

Here is a full example, let me know if that helps.

<?php
/**
 * @param string $content
 * @param int    $maxSize
 *
 * @return array
 */
function chunk($content, $maxSize)
{
    $parts = [];
    $prefix = '';
    while (true) {
        $content = trim((string)$content);
        if (strlen($content) <= $maxSize) {
            $parts[] = $prefix . $content;
            break;
        }
        $offset = -(strlen($content) - $maxSize);
        $cut_at_position = strrpos($content, ' ', $offset);
        if (false === $cut_at_position) {
            $cut_at_position = $maxSize;
        }
        $parts[] = $prefix . substr($content, 0, $cut_at_position);
        $content = substr($content, $cut_at_position);
        $prefix = '… ';
    }

    return $parts;
}

$data = [
    'page_id' => 666,
    'title'   => 'Page title',
];

$content = 'This is my large content';

$chunks = chunk($content);

$records = [];
foreach ($chunks as $i => $chunk) {
    $records[] = array_merge(
        $data,
        [
            'objectID' => $data['page_id'] . '-' . $i,
            'content'  => $chunk,
        ]
    );
}


$client = new \AlgoliaSearch\Client('appID', 'adminAPIKey');
$index = $client->initIndex('pages');

// Init the settings. Only has to be done at index creation.
// Can be done on the Algolia dashboard.
$index->setSettings(
    [
        'searchableAttributes'  => ['title', 'content'],

        // This will only return the most relevant record for a given page.
        'distinct'              => true,
        'attributeForDistinct' => 'page_id'
    ]
);

// Push the records.
$index->addObjects($records);

// Search.
$results = $index->search('query');

2 Likes

Does that help @janic?

Need to recreate solution to make sure it works,
but your example is very helpful.

Thanks, @rayrutjes!

1 Like

@janic, please share your work and let me know if I can help with something.

Hi there, and where all this code goes? I’m using Laravel 5.6. I’m wondering how to split records and All I have is my toSearchableArray() method at my model, never had to instantiate Algolia or use methods like initIndex(), setSettings() or addObject() before.

    public function toSearchableArray()
                { 
                    $post = $this->toArray();
                    
                    $response = [
                        'title' => $post['title'], 
                        'body' => $post['body'] 
                    ];

                    return $response;
                }

My $post['body'] gets too long…