Reaxial Update – On Stages And Actors

Since I last wrote about Reaxial we’ve come up with some new abstractions that make it easier to write reactive handlers, and have been busy transitioning our code to use the new architecture. I thought I’d take this opportunity to share our progress with you.

As we started transitioning to Reaxial, we realized that creating an entire service for each reactive component was a bit of overkill. Many features we have implemented with reactive components run sporadically and are not particularly time sensitive, and typically there are a number of features that depend on the same updates. Having a separate process and a separate connection to Kafka is wasteful and inefficient in these cases. However, other features have to react in a timely fashion, so for those we do want a dedicated process with its own Kafka connection.

To accommodate these different use cases, we came up with the concept of a “stage” service that can host one or more “actors”. An “actor” is our basic building block for reactive components. Each actor is a python class that derives from this abstract base class:

class Actor(object):
 def topics(self):
 """ Return a list of the topic(s) this actor cares about. """
 raise NotImplemented

def interval(self):
 """ Return the batching interval for this actor. This is the maximum
 interval. If another actor on the same stage has a shorter interval,
 then the batching interval will match that interval.
 return 30

def process(self, topic, messages):
 """ Called periodically for this actor to process messages that have been
 received since the last batching interval. If messages for multiple
 different topics have been received, then this method will be called
 once for each different topic. The messages will be passed as an array
 of tuples (offset, message).
 raise NotImplemented

 def log(self):
 return getLogger(self.__module__)

All that is required for an actor class to override is topics() and process(). The topics() method simply returns a list of Kafka topics that the actor wants to handle, and the process() method is then called periodically by the stage service with a set of messages from one of these topics. The stage service works by collecting a batch of messages (1000 by default) across all the topics that all the actors within that stage care about, and then invoking each actor’s process() method with the messages in the topics that that actor cares about. If the batching interval expires while the stage is collecting messages, then the messages that have already been collected are processed immediately.

Once an actor is defined, it has to be configured to run within a specific stage. We are using a simple INI-style config file using betterconfig to define the various stages. Each stage is a section in the config file and the actors are specified by adding the python dotted path to the actor class to a list inside the section. In addition, the batch size for the stage can be changed here too.

We are still in the middle of the process of converting the functionality in our legacy platform to Reaxial, but we have already defined 30 actors running on 7 different stages. Having the infrastructure to easily decompose a feature into reactive components like actors improves the modularity and reliability of our system, and also improves testability. We can very easily write unit tests that pass specific messages to an actor and by mocking out the methods that the actor calls, we can test arbitrary scenarios without having to set up anything in the database. Plus, because actors only implement one feature, or one piece of a feature, they are straightforward unit testing targets.

One obvious area for improvement is to enhance the stage service so that it dynamically decides which actors to run on which stages by observing their behavior. This has always been in our plans, but because it is a complicated optimization problem and carries significant risks if not implemented properly, we decided to stick with the manual stage configuration for now, coupled with monitoring of the stages to ensure that time-sensitive messages are being handled within the expected time. So far this is working well, and as we improve this system we’ll keep you updated on our progress.

Reaxial – A reactive architecture for Axial

Software engineering is hard. Even a small software project involves making countless trade-offs between the ideal solution and “good enough” code. Software engineering at a startup is even harder because the requirements are often vague and in constant flux, and economic realities force us to release less-than-perfect code all the time.

Over time these decisions pile up and the technical debt becomes overwhelming. Axial has hit this point a few times. Until about two years ago, Axial’s platform was a single monolithic Django application that was becoming increasingly bloated, slow and unmaintainable. At that time, the decision was made to “go SOA” and we started to decompose this Django application into smaller services mostly based on Flask, and, more recently, Pyramid.

Some of the services that we’ve broken out since then are small and focused and make logical sense as independent services, but others turned out to be awkward and inefficient and resulted in brittle and tightly-coupled code. Further, it has become clear that our current architecture does not align well with the features and functionality in our roadmap, such as real-time updates to our members. We realized that we needed a new architecture. Thus was born Reaxial.

The design keywords for Reaxial are reactive, modular and scalable. Reactive means that the system responds immediately to new information, all the way from the backend to the frontend. Modular means that parts of the system are decoupled, making it easier and faster to develop and test (and discard) new features without disturbing existing code. Scalable means that as we add members, we can simply add hardware to accommodate the additional load without slowing down the site.

To achieve these goals, we knew that we needed to use some sort of messaging middleware. After researching the various options out there, including commercial solutions like RTI Connext and WebSphere, and open-source packages like RabbitMQ, nanomsg and NATS, we settled on Apache Kafka. Kafka was developed by LinkedIn and offers a very attractive combination of high throughput, low latency and guaranteed delivery. LinkedIn has over 300 million users, which is an order of magnitude more than we ever expect to have to support, so we are confident that Kafka will scale well as we grow. Further, because Kafka retains messages for a long time (days or weeks), it is possible to replay messages if necessary, improving testability and modularity. With Kafka as the underlying message bus, the rest of the architecture took shape:

Reaxial Architecture

Probably the most important new service is the entity service. The entity service handles CRUD for all top-level entities, including classes like Company, Member and Contact, among many others. Whenever an entity is created or updated, a copy of it is published on the message bus, where it can be consumed in real-time by other services. Simple CRUD does not handle the case where multiple entities need to be created or updated in a transaction, so to handle that the entity service also offers a special API call, create_entity_graph, that can create and update a set of related entities atomically. In the Reaxial architecture, most features will be implemented as a service that subscribes to one or more entity classes, and then reacts to changes as they occur by either make further updates to those entities or by creating or updating some other entity.

Recall that our design goals were to enable real-time updates all the way to the member. To accomplish this, we created a subscription service that uses SockJS to support a persistent bidirectional socket connection to the browser. This service, written in NodeJS, subscribes to changes in all the entity classes and allows the browser to subscribe to whatever specific entities on which it wants updates, and for which the user session is permissioned to see, of course.

We have deployed these components to production and are just starting to use them for some new features. As we gain confidence in this new infrastructure, we will slowly transition our existing code base to the Reaxial model. We decided to deploy Reaxial gradually so that we can start to reap the benefits of it right away, and to give us an opportunity to detect and resolve any deployment risks before we are fully dependent on the new architecture. We have a lot of work ahead of us but we are all quite excited about our modular, scalable and reactive future.