Technical/Service technology-specific questions

Greetings,

I was wondering if I could ask a few technical questions, to satisfy my intellectual curiosity. I asked @dzello and he suggested I ask them here.

  • What’s the index partitioning and sizing strategy? i.e how large is an index segment or partition managed by a single server/node, in terms of say, number of documents or some other metric of ‘size’ ?
  • What’s the latency overhead when multiple OR expressions need to be evaluated, which presumably requires scanning multiple posting lists (or radix tree branches in your case) in parallel/lockstep?
  • What are some of the largest users/customers of Algolia, in terms of index/documents size, that I could check out? I toyed around with hacker news search yesterday and it was quite impressive how fast the results were returned, although I tried to use OR and NOT operators and that didn’t work (presumably because it is configured to handle tokens differently), and for some random queries with many, many, tokens no results were returned after a few seconds.

Thank you very much in advance.

Hi @markpapadakis! Thanks for posting over here and great questions. I’m at a conference this morning but one of my colleagues will try to get these answered for you asap.

1 Like

Hello @markpapadakis,

Wow, great set of questions! :slight_smile: I will try to answer them.

Partitioning

While every Algolia cluster has three nodes, each node contains a full copy of the data. This is how we achieve very high availability: whenever one node is down, two nodes remain and can handle any write or read operation; whenever two nodes are down (which never happens in practice), the cluster stops accepting write operations (since the quorum is not reached) but can still fulfill read operations.

Inside the same node, I would say “the hardware is the limit”. The engine maps index files to memory for faster access, so we are fine until one index becomes bigger than the available RAM. (That’s just an order of magnitude, because other factors come into play, but it’s a good indicator.) We have servers with 64GB or 128GB of RAM.

Indices are not split in “segments” as can be the case in Lucene, for example. We have our own unique way of splitting indices when needed, which would probably be too lengthy to explain here. Our CTO Julien Lemoine regularly blogs about the internals of our search engine. My little finger tells me that the next installment in the “Inside the Engine” series might be about partitioning, so stay tuned! :wink:

Overhead of disjunctions (OR)

It’s true that each disjunction needs to be handled separately when looking up the index. And the results need to be merged, resulting in a linear complexity of this part of the algorithm.

However, the lookup itself does not account for the entire latency. There is first a fixed overhead spent parsing the query, then a linear cost spent building the results (parsing the objects, highlighting and snippeting, restricting the returned attributes, etc.). For “simple enough” queries, building results can actually account for most of the processing time!

Also, keep in mind that the results are paginated, which means that even if the query matches a million objects, we can stop processing after identifying the first N best matches (which can be done quickly thanks to Algolia’s pre-computed ranking strategy).

Large data sets

I am afraid I cannot name names here, but some of our enterprise customers have hundreds of millions of records and/or hundreds of gigabytes of data. Sorry for not being more specific; we cannot disclose any numbers without our customers’ agreement.

It’s important to note that the response time is little affected by the size of the data set, because indices are updated asynchronously (“eventual consistency”) and read operations have higher CPU priority than write operations. This is why you should observe little correlation between the total data size and the response time. The complexity of queries has a far stronger impact on response time.


I hope this answers your questions. Let me conclude by saying that if you have concerns about a specific use case, I would highly recommend that you get in touch with one of our Solutions Engineer, who can help you fine-tune Algolia to your needs.

4 Likes

Thank you so much Clement; I really appreciate the insight. This is more than I hoped for. Thank you again.

Hm. If you are going to rely on static rank scores computed offline or otherwise not during query expression evaluation, doesn’t that mean that you can’t take into account the record’s search score which would e.g depend on which terms match what fields and their proximity ? I take it you are computing a score for each distinct (term, record) ? Thanks !

That’s correct! Only the business relevance is pre-computed at build time; the textual relevance has to be computed at query time. But the business relevance is still useful to break ties between records ranking identically with respect to textual relevance.

Again, for more insight, I am going to redirect you to Julien Lemoine’s blog articles, in particular Inside the Engine, part 4: Textual Relevance.

2 Likes

Thank you again Clement. I 've added a few links here https://github.com/phaistos-networks/Trinity/wiki/IR-Search-Links – I mostly wanted to understand the tradeoffs you had to make to accomplish your goals and the fundamental differences from a general-puprose full-text search engine.

1 Like

Thanks @markpapadakis for including Algolia in your list!

Just one thing (quoting you):

Only downside as far as I can tell is limited if any support for operators, OR, NOT, etc.

I would like to clarify a few things:

  • Algolia does have support for AND, OR and NOT operators as regards filters (i.e. when acting on structured data). The only restriction is that you cannot have ANDs inside ORs (conjunctions inside disjunctions). This is essentially for performance reasons. In our experience, it doesn’t really hurt, as such use cases are often either dubious or can be reformulated otherwise. Of course, you can always stumble upon a perfectly valid edge case that requires it. :slight_smile:

  • Algolia does also support operators on full text search… it’s just that they don’t appear as such:

Although this is arguably less complete than a full-fledged query language with operators, it suits many use cases well. In practice, most query strings are entered by laypeople, who do not think in terms of operators. For example, searching “to be or not to be” most probably doesn’t mean "to be" OR NOT("to be")… which would be a tautology… but instead refers to some famous theatre play. :wink:

1 Like