For capacity planning and stability reasons it is important to be able to estimate how Cassandra will act in a multi-dc environment during adverse networking conditions. This includes capacity planning for WAN bandwidth and ensuring that after a network partition you can stream the hints off both fast enough that they don't expire and TTL off or stream so fast that it saturates your WAN and crashes the cluster. In this blog we'll go into how to compute the streaming speeds. Then in the next blog we will look at fine-tuning the number of threads needed to stream data safely after a network partition.
Regular WAN Bandwidth
First we need some regular baseline numbers on WAN bandwidth requirements. Just from watching average running of a single DC cluster you can figure out WAN bandwidth network neccessary for day to day operations. Rough, back of envelope, estimates for WAN bandwidth can be had by disabling thrift/native_transport on a node and measure internal network communication. At this point the node will just be handling internal cassandra reads/writes, divide this number by your replication factor (because writes are sent just once over the WAN) for a high water mark of WAN bandwidth during a normal day. In practice this estimate will be higher than actual WAN traffic because the WAN is only seeing writes and not reads.
WAN Bandwidth and Hints
Next comes the much harder part. What network will be required after a temporary network partition. After network is restored you'll have both regular traffic and traffic of hints.
The primary settings for hints are in cassandra.yaml and look like this:
# See http://wiki.apache.org/cassandra/HintedHandoff
hinted_handoff_enabled: true
# this defines the maximum amount of time a dead host will have hints
# generated. After it has been dead this long, new hints for it will not be
# created until it has been seen alive and gone down again.
max_hint_window_in_ms: 10800000 # 3 hours
# Maximum throttle in KBs per second, per delivery thread. This will be
# reduced proportionally to the number of nodes in the cluster. (If there
# are two nodes in the cluster, each delivery thread will use the maximum
# rate; if there are three, each will throttle to half of the maximum,
# since we expect two nodes to be delivering hints simultaneously.)
hinted_handoff_throttle_in_kb: 1024
# Number of threads with which to deliver hints;
# Consider increasing this number when you have multi-dc deployments, since
# cross-dc handoff tends to be slower
max_hints_delivery_threads: 2
The notes for max_hints_delivery_threads
of "Consider increasing this number when you have multi-dc deployments" is less than helpful and there is currently no good documentation on the web to tune this value. So lets dig into the source and see how this all works so we can plan for it with multiple datacenters.
Into the Source
First off when cassandra starts it creates a ThreadPool of matching max_hints_delivery_threads
(the getMaxHintsThread method)
java/org/apache/cassandra/db/HintedHandOffManager.java:
104 private final JMXEnabledScheduledThreadPoolExecutor executor =
105 new JMXEnabledScheduledThreadPoolExecutor(
106 DatabaseDescriptor.getMaxHintsThread(),
107 new NamedThreadFactory("HintedHandoff", Thread.MIN_PRIORITY),
108 "internal");
Next when a node is seen via gossip, it will schedule a hint transfer
cassandra/src/java/org/apache/cassandra/service/StorageService.java:
1963 public void onAlive(InetAddress endpoint, EndpointState state)
1964 {
...
1969 HintedHandOffManager.instance.scheduleHintDelivery(endpoint, true);
1970 for (IEndpointLifecycleSubscriber subscriber : lifecycleSubscribers)
1971 subscriber.onUp(endpoint);
Which creates an executor instance to handle the handoff.
cassandra/src/java/org/apache/cassandra/db/HintedHandOffManager.java:
528 public void scheduleHintDelivery(final InetAddress to, final boolean precompact)
529 {
530 // We should not deliver hints to the same host in 2 different threads
531 if (!queuedDeliveries.add(to))
532 return;
533
534 logger.debug("Scheduling delivery of Hints to {}", to);
535
536 executor.execute(new Runnable()
And finally, here's the code which divides our hinted_handoff_throttle_in_kb
by the cluster size
cassandra/src/java/org/apache/cassandra/db/HintedHandOffManager.java:
359 // rate limit is in bytes per second. Uses Double.MAX_VALUE if disabled (set to 0 in cassandra.yaml).
360 // max rate is scaled by the number of nodes in the cluster (CASSANDRA-5272).
361 int throttleInKB = DatabaseDescriptor.getHintedHandoffThrottleInKB()
362 / (StorageService.instance.getTokenMetadata().getAllEndpoints().size() - 1);
The Math for Hints
So lets put all this together. Say you have a cluster with nodes split among 2 DCs, DC1 and DC2. DC2 goes down for a time, then returns to service.
Your maximum outbound hint streaming speed per node is computed by
max_streaming_speed_per_node = ( hinted_handoff_throttle_in_kb / node_count - 1 ) * max_hints_delivery_threads
But because Cassandra only allows one outbound hint thread per remote node, the maximum inbound hint streaming per node will still be hinted_handoff_throttle_in_kb. This is important because you can then safely increase max_hints_delivery_threads
without worrying about overwhealming a single node.
Then in the case of a network partition, we'd expect streaming to be queued for the entire DC that went off the web. So expected WAN bandwidth usage would be
max_wan_hint_speed = max_streaming_speed_per_node * DC1_node_count
Next post looks at taking these numbers and figuring out how long it'll take hints to replay based on different max_hints_delivery_threads
settings.