From Monolith to Microservices
Monolith to microservices is a transition that many companies are making nowadays. There are many advantages, and not a few disadvantages, to making the switch. In this post I’ll show how we diverted some of our traffic from our monolith to a new microservice and the benefits we reaped from that change.
This blog shows how @SysAid diverted some of its traffic from its monolith to a new microservice and the benefits it reaped from that change. #SysAidTech #Tech Share on XThe problem
SysAid’s traffic can be split into two sources: human and machine. Our human traffic originates with our users and our machine traffic comes from software agents our customers install on their computers. The agents report back to the system with information that helps the system administrators perform asset management and IT service management (ITSM) more efficiently. The software agents report on their status by sending messages to our servers, which are written to our database. Some of the agent traffic is served by the same application servers as our human traffic. Here’s a diagram illustrating the situation.
Our software agent sends a HTTP message (1) that gets routed from our load balancer to one of our application servers (2), from there it’s written to the file system (3) and a background task (4), writes the contents of the message to our database (5).
It should be mentioned that the load characteristics of the traffic from humans and machines differ. The former is affected by on-hours and off-hours. The latter is more bursty, with no specific pattern. It’s quite common for many agents from one client to report every hour on the hour, causing a sudden and short traffic spike. This causes two issues. The first is that these sudden bursts of traffic cause a CPU spike on our servers and slow all users, machine and human. The second is that sometimes the database can become overloaded, forcing us to drop some messages and affecting the stability of our application. We wanted to solve both of these problems.
Possible alternatives
There were three alternatives we considered:
- Have a cluster dedicated to machine traffic.
- Have our agents write their messages to a queue, instead of going through our application servers and being written to the file system. These messages would then be processed by our background services.
- Have the messages be written to a queue by a server side component and then handled as in (2).
The solution
In the end we decided to go with the last option since option 1 would’ve been expensive and hard to maintain, although very easy to set up since no code changes would’ve been necessary. Option 2 would’ve required changes to our software agents and rolling that out would’ve been a very long process. Option 3 required no changes to our agents and very little change to our backend code. The only major modification we would have to make is to our infrastructure. Here’s what we came up with.
The traffic flows from the client to the load balancer as usual. The load balancer, instead of routing the traffic to our application servers, routes it to an AWS API Gateway, which writes it to a queue. We decided to use regular SQS since writing to the database is idempotent, so no need for a FIFO queue, and we had no complicated requirements, such as being able to replay traffic. The scheduler then reads messages from the queue and writes them to the database as usual. The only code changes required were to read from SQS instead of a file system. One final thing to note about this architecture is that the scheduler is part of a cluster. When we experience high load on the database we scale the cluster down to zero instances until things calm down. When we have lots of messages in the queue we scale the cluster up until the number of messages decreases sufficiently.
This simple architecture allows us to achieve much higher reliability. Since the messages are written to SQS by an API Gateway the only way we can lose messages is if something goes wrong with AWS. If something fails at our end we can be sure that messages will be saved to the queue until we fix things. When the database experiences lots of load we can wait until things clear up and then write the messages, increasing the reliability of our database and avoiding having to drop messages. What this means is that this service has the same service level agreement (SLA) as AWS. Additionally, our human users will no longer be affected by spikes of non-human traffic. In fact, after deploying the service we saw a 40% reduction in the CPU of the application servers handling human traffic.
Since changes to our codebase were minimal development time was quite short. Another bonus was that we experienced very few errors during rollout. The deployment process was so smooth that within a couple of release cycles all our clusters had the new microservice in place and no major bug fixes or rollbacks were necessary.
So that was the story of how we broke off a piece of our monolith in order to greatly increase our system’s reliability. I hope you enjoyed it!
Did you find this interesting?Share it with others:
Did you find this interesting? Share it with others: