Our last post covered some lessons we learned while building our backend with Akka. In this post, we’re going to go into detail on some of those lessons. First, we’ll show how futures can violate Akka’s guarantees about concurrent state modifications, then we’ll talk about how to avoid that trap. Following that, we’ll touch on design considerations when coordinating work across a cluster.
(Shameless plug: To join Conspire and get a personalized weekly update about your email relationships, sign up at goconspire.com.)
Be Careful with Futures
We’ll start with an example. First let’s get some definitions out of the way.
Worker actor just performs some sort of processing for a user, we aren’t particularly concerned with its details right now.
Imagine we want to have an actor to create and coordinate work requests and worker actors. We might try something like this:
We create a new
Worker and ask it for the result of the work we pass it and then return that result to the sender. When work requests come in, we create a
Worker and pass it the unit of Work within an
ask() and then hoist a callback on the future in order to pass the result to the requesting actor. What’s the issue with this implementation? For that, let’s take a look at the Akka API for the Actor class
Look at the definition for
sender: it’s a function, not a variable. The callback for our future may or may not be processed on the same thread and sender may or may not have the value we expect, especially if other messages come in to the Coordinator before the callback is executed. We could end up sending the result to the wrong actor.
Sender isn’t the only potential problem. If our actor contained some piece of local mutable state, say a Map of current workers, and we tried to modify that state within the callback, we could wind up with a
ConcurrentModificationException. Passing state into the callback of the future violates basic guarantees provided by Akka, namely that the internal state of an actor will not be accessed or modified outside of an actor’s receive function. Futures make violations of this rule easy to write.
sender—or any form of mutable state—within an actor is a huge recipe for disaster. At the moment Akka and Scala can’t enforce this (though efforts are being made in that direction). As a programmer you have to ensure you don’t close over mutable state. Scala makes it very easy to write code like this: it’s a trap to be aware of.
State management is not the only task made more difficult by futures. As outlined in our previous post, futures can lead to what we term “rogue concurrency.” This occurs when far more work is done at once than is expected. On a beefy local development machine, this isn’t an issue but once you move your application into the cloud, you’ll quickly find that your JVM becomes unresponsive. Futures make controlling the amount of work done concurrently more difficult.
How can we fix this implementation?
First of all, we’ve eliminated the future. Without futures, violating the rule mentioned above is much more difficult to do accidentally. Now we use an internal map to keep track of requesters in order to route results to the appropriate actor. Our requesters map is only ever accessed or modified within the actor’s receive function: we will never throw any sort of
ConcurrentModificationException or find that our expectation of the world within this actor doesn’t match reality.
Note our use of a pattern guard within the receive (
if workers.isDefinedAt(user)). This allows us to gracefully handle unexpected
WorkIsDone messages from users we don’t know to be in progress. Such messages will go to Akka’s unhandled message queue for handling elsewhere.
WorkIsDone case, we safely pass the result to the appropriate requesting actor and remove the user reference from our map.
This code still has a number of problems which are left as an exercise to the reader: concurrent requests for the same user are not gracefully handled. Typically this pattern would be used to implement some form of supervision or throttling but these are not present here. This implementation also assumes that the
Worker shuts itself down upon completion—if we don’t know this to be true, we should call context.stop on the worker when we get
WorkIsDone back in order to avoid leaks.
We can slightly improve upon
GoodWorkerCoordinator by eliminating the workers variable. This trick uses Akka’s
become() function to dynamically replace the receive function of an actor. Instead of keeping the map of requesters as a variable, we can pass it as a parameter to our receive function, updating the map with
context.become. This approach is almost identical to
GoodWorkerCoordinator and if it feels like clever-for-clever’s-sake, there is little downside in using the previous implementation. This approach is mostly presented to show one use of
context.become, an underutilized tool in Akka’s kit.
Don’t Split Superversion Across Your Cluster
First, a brief refresher on our architecture: we have one supervisor node which dispatches work to a cluster of worker nodes. Each worker node has a given role denoting its service (e.g., “analytics”) and the supervisor remotely creates worker actors for the appropriate service on the worker node as needed.
The original design for our backend had each role supervising itself: using Akka’s cluster-aware routers and ClusterSingletonManager, roles would use the oldest member node of the same role as their leader which would remotely deploy workers to nodes of the given role. That is, the oldest node with the “analytics” role held the AnalyticsLeader which remotely deployed analytics workers to all nodes with that role. If that node died, the next oldest node with that role would take over. This approach is roughly the same as that described here. This pattern works but poses some problems.
- Work has to be tracked by both the master supervisor and the role leader. In our design, the leader kept a queue of work requests. If it died, that queue was lost and the supervisor had to start over. Work needed to be tracked by both the master supervisor and leader. The point, though, is that if you don’t do that right then, when a leader goes down, you lose state information. We never actually added work tracking to the supervisor because we realized the pattern was broken. Tracking state in both places is redundant and unnecessary.
- The supervisor and all worker nodes for a given role must track the oldest node. Should that node die, a hole opens: nodes may try send work requests or route completion notifications through the leader despite its death if that member hasn’t yet been removed from the cluster. Messages could get lost.
This pattern gave us multiple single-points-of-failure. Ultimately, our cluster requires the supervisor to be running for any work to be dispatched. Given the nature of our work, this is a point of failure we’re comfortable with. Introducing role leaders as another potential point of failure pushed us into a corner: nodes could be perfectly ready for work and the supervisor ready to dispatch work but the that work could get lost because the leader had crashed but was not yet removed from the cluster. We ran into all sorts of headaches with this.
Ultimately, we moved the leaders onto the supervisor. Akka made this change extremely simple: we changed one line in our config. Cluster-aware routers have a setting to toggle the deployment of routees on the same node as the router. By toggling this to “off”, the role leader could be on the supervisor node but only create workers on analytics nodes.
The only change was to the
Our initial design caused problems but Akka’s power made the solution simple. Flushing out this change took almost no time at all. As a side benefit, we eliminated the cluster singleton pattern from our workers, allowing each node of a given role to be treated exactly the same. In fact, all our worker nodes are spun up with nothing more than a basic ActorSystem and a single actor which facilitates communication with the supervisor. We also eliminated work pushing in favor of work pulling, a change we will detail in our final post.
In our final post for this series, we’ll go into more detail about the design of our actor system and hierarchy and how we arrived at our final architecture.
(Hint: pushing work is easy to do but hard to do right)