Sizing and monitoring of SolrCloud

It is important to understand that SolrCloud is horizontally scalable. However, each node needs to have a certain capacity. The amount of CPU, disk, and RAM required for each node in SolrCloud needs to be figured out for the efficient allocation of resources. Though no fixed number can be assigned to these parameters as each application is unique, each index within each application has a unique indexing pattern—the number of documents that need to be indexed per second, the size of the documents, and the fields, the tokenization and the storage parameters defined. Similarly, the search patterns would also differ across indexes belonging to different applications. The number of queries per second and the search parameters can be different. The amount of data retrieved from Solr and the faceting and grouping parameters play an important role in the handling of resources used during querying.

Therefore, it is difficult to assign numbers to the RAM, CPU, and disk requirements for each node. Ideally, we should implement sharding on the size of the shard instead of the size of the collection. Routing is another very important parameter in the index. It would save a lot of network IO. The weight of a particular shard depends on the routing parameter, either in terms of the number of documents or the number of queries per second.

It is important to restrict the disk space to two to three times the size of the index. When index optimization happens, it uses up more than twice the disk space.

The ideal way to go about sizing is to put a few normal machines as the nodes of SolrCloud and monitor their resource usage. For each node, we need to monitor the following parameters:

Load average or CPU usage
Disk usage
RAM usage, or RAM utilization across each core
Core to CPU consumption, or CPU utilization across each core
Collection to node consumption, or how the requests are being distributed across each node in the collection

Once these parameters are in place, nodes belonging to some shards are found to be overweight. These shards need to be split further in order to properly address scalability issues that may occur in the future.

The health of SolrCloud can be monitored via the following files or directories in the ZooKeeper server:

clusterstate.json
/livenodes

We need to constantly watch the state of each core in the cluster. livenodes provides us with a list of available nodes. If any node or core goes offline, a notification has to be sent out. Additionally, it is important to have enough replicas distributed in such a fashion that a core or a node going down should not affect the availability of the cloud. The following points need to be considered while planning out the nodes, cores, and replicas of SolrCloud:

Each collection should have an appropriate number of shards
Shards should have a leader and more than one replica
Leaders and replicas should be on different physical nodes
Even when using virtual machines, the third step should be considered
A few standby nodes, which can be assigned as replicas, should be set up, if needed
An automated process should be followed for setting up standby nodes
An automated process should be followed for spawning new nodes
Checks on the network IO should be scheduled to identify the network or cluster traffic and continually optimized

These action points will make the SolrCloud function in an effortless manner.