Wow, great set of questions! I will try to answer them.
While every Algolia cluster has three nodes, each node contains a full copy of the data. This is how we achieve very high availability: whenever one node is down, two nodes remain and can handle any write or read operation; whenever two nodes are down (which never happens in practice), the cluster stops accepting write operations (since the quorum is not reached) but can still fulfill read operations.
Inside the same node, I would say “the hardware is the limit”. The engine maps index files to memory for faster access, so we are fine until one index becomes bigger than the available RAM. (That’s just an order of magnitude, because other factors come into play, but it’s a good indicator.) We have servers with 64GB or 128GB of RAM.
Indices are not split in “segments” as can be the case in Lucene, for example. We have our own unique way of splitting indices when needed, which would probably be too lengthy to explain here. Our CTO Julien Lemoine regularly blogs about the internals of our search engine. My little finger tells me that the next installment in the “Inside the Engine” series might be about partitioning, so stay tuned!
Overhead of disjunctions (
It’s true that each disjunction needs to be handled separately when looking up the index. And the results need to be merged, resulting in a linear complexity of this part of the algorithm.
However, the lookup itself does not account for the entire latency. There is first a fixed overhead spent parsing the query, then a linear cost spent building the results (parsing the objects, highlighting and snippeting, restricting the returned attributes, etc.). For “simple enough” queries, building results can actually account for most of the processing time!
Also, keep in mind that the results are paginated, which means that even if the query matches a million objects, we can stop processing after identifying the first N best matches (which can be done quickly thanks to Algolia’s pre-computed ranking strategy).
Large data sets
I am afraid I cannot name names here, but some of our enterprise customers have hundreds of millions of records and/or hundreds of gigabytes of data. Sorry for not being more specific; we cannot disclose any numbers without our customers’ agreement.
It’s important to note that the response time is little affected by the size of the data set, because indices are updated asynchronously (“eventual consistency”) and read operations have higher CPU priority than write operations. This is why you should observe little correlation between the total data size and the response time. The complexity of queries has a far stronger impact on response time.
I hope this answers your questions. Let me conclude by saying that if you have concerns about a specific use case, I would highly recommend that you get in touch with one of our Solutions Engineer, who can help you fine-tune Algolia to your needs.