Pandemic Scaling Adventures at Nearpod. Part 1: Pushing Redis Cluster to it’s limits, and back
At Nearpod we make extensive use of Redis. The majority of that use comes in the form of a significantly sized Redis Cluster which we use to manage the millions of virtual classroom sessions that go on throughout the day. We run our Redis Cluster in Elasticache, the AWS managed service for in-memory datastores.
This past back-to-school period, Nearpod has seen unprecedented growth. For what was already a popular EdTech tool, you can only imagine the increase in users we’ve seen as students returned to school — virtually — and the world found itself urgently in need of solutions to bridge the learning divide. Our cloud platform immediately underwent a 10x scale event. From this experience, we have some lessons learned about scaling Redis Cluster that are definitely worth sharing.
As we grew, we quickly found ourselves scaling up our Redis Cluster to match the load of millions upon millions of new students joining classes every day. Before long, we scaled all the way up to the limit of nodes that Elasticache allows by default, 90. While we were pleased with the scale our SaaS platform had achieved, we were also concerned about how far we could scale the cluster before it would break, not to mention the cost of running such a behemoth.
As student activity continued beyond anyone’s wildest expectations, we found ourselves looking to stay ahead of the game — out of an abundance of caution, we opened discussions with Amazon to increase the limit for the maximum number they’d allow in a cluster. We were already sitting at 45 shards, each with 1 master and 1 replica. Amazon agreed to help, however, it took AWS a few attempts to determine how to change the configuration, which told us what we already suspected: we were probably running one of the largest Redis Clusters in Elasticache.
That’s when things got even more interesting. After the limit increase was applied by AWS, we noticed something very peculiar. Our nodes hosting shard 46 and beyond were experiencing significantly less load than the nodes for shards 1–45, despite serving a number of keys in parity with the other servers.
Puzzled at why this would be, we drilled into the issue. Once we got to the heart of it, it came down to two things: one very detailed (and potentially costly) nuance specific to the open-source code we use to access Redis, and one bug in how Elasticache handles clusters beyond their limit, which helped expose the first issue.
A Refresher on Redis Cluster:
If you are so inclined, you can read the Redis Cluster specification, but the TL;DR version of how it works is as follows:
There are 16k “key slots” and each key slot is assigned to a master node in the cluster (as well as any replicas of said master). Keys are mapped to a slot using a hash function on the key name.
When an application connects to Redis Cluster, it needs to do two things:
- Locate any working node in the cluster
- Issue the
CLUSTER SLOTS
command, to ask which nodes have which slots
Once the application does that, the client now has all the info it needs to ask the correct node for the keys it needs.
It’s important to note here that the CLUSTER SLOTS
command is “O(N) where N is the total number of cluster nodes.” In other words, it gets slower the larger the cluster is.
Determining what was different:
As we investigated, the first thing we did, of course, was look to see if there was a difference in the workload that was hitting the new nodes. The get/set metrics looked identical to the other nodes. Running MONITOR
on the node didn’t show anything that stood out as different at all, and nothing on the INFO
command pointed to something being wrong. We were exhausting all the usual Redis monitoring avenues.
We wondered to ourselves: “Is there anything like a TOP command for Redis?” It was then that we recalled that Redis does indeed have some additional internal cpu usage metrics that aren’t exposed by default.
Those metrics are accessible using the INFO COMMANDSTATS
command, and it shows how many calls have been made to each type of Redis command, along with the total and average times for each one. Once we had that data, we could see that across the first 45 shards, the CLUSTER family of commands accounted for an astounding 87% of the cpu time (!!!), versus virtually zero on the new shards.
Getting to a Root Cause:
After the initial stun subsided, we knew we had 2 questions to answer:
- Why are our nodes spending more time on the CLUSTER commands than on doing actual application work?
- Why wasn’t shard 46 and beyond doing the same?
For question 1, after a closer look, the answer turned out to be quite simple, but hidden deep within the nuance of the open-source client we use to interact with Redis Cluster. Even though we intentionally use persistent connections, the default behavior of the PHP client, phpredis, is that it doesn’t actually keep a copy of the slots mapping between different instantiations of the client. That means that every time we created a new client, it would indeed reuse an existing connection as expected, however it would also ask for cluster slots again. Luckily for us, there is a configuration to keep a local copy of the slots that persists alongside the connection.
Once we turned that on, the cpu usage on our Redis nodes dropped almost in half. And, thanks to no longer needing to wait for the additional command, we saw a welcome decrease in our application response times.
The second question turned out to be a little more interesting from a cloud infrastructure perspective:
There are many ways that you can do service discovery to locate a working node in order to bootstrap your client. You can maintain a list of nodes that the application can reach, have a configuration endpoint that returns a suitable set of available candidates, or you can have one large DNS response with the IP addresses of all of the nodes. The Elasticache implementation is indeed the latter, to return all nodes in the DNS response — i.e.: one big big list of A records.
The problem with this approach is that there is a limit to how big a DNS reply can or should be. It certainly makes sense to impose a limit on it, and the way Elasticache goes about it is to limit the response to the first 90 nodes. As you’d guess, this in turn causes shards 46 and above to never be a node the client selects to initially connect to, which in turn means they never get a CLUSTER SLOTS
command — thus causing the significant difference in load experienced by those nodes. We brought this to Amazon’s attention, and along with possible solutions that would help engineers scale more effectively at these cluster sizes.
Luckily for us, however, solving the first part of the problem — drastically reducing the CPU utilization used by the costly and frequent cluster slot commands — meant we could aggressively begin reducing our cluster size, making the node discovery issue moot (and our AWS utilization costs much more pleasant).
Today, we are gladly serving our ever-increasing number of teachers and students through this pandemic, and continuing to scale as always.