#EPU 8: The sound of Silence 👂
What happened?
There haven't been any new updates from us at Enterspeed in the last couple of months. There hasn't been anything except the sound of silence.
But fear not; there's no darkness here.
We have been busy optimising and tweaking our processing layer, making it run like a smooth, well-oiled machine.
You see, we experienced an incident with a backlogged processing queue at the beginning of January. This incident caused us to throw everything in our hands and optimise our processing layer with multiple small and large improvements.
This update will show what we've been doing in Enterspeed for the last couple of months and what we have done to avoid similar incidents.
Fair processing
Processing is usually very straightforward: One source entity is ingested, and a transformed view is ready for the Delivery API in a few seconds.
And this works very well. However, at other times, the processing process becomes more complicated.
Examples of more complicated processing scenarios are, e.g. a schema deployment or a re-ingest of all source entities.
Processing in these scenarios results can result in many thousands of processing jobs. Thus processing time is not as instant as in the 1:1 source entity to view scenario.
That is by design, and we don't claim or even want to scale to instant processing times when the volumes become large. So this pre-processing model comes with some latency tradeoffs. But the problem was that one customer could affect other customers, and nobody likes noisy neighbours.
This is, of course, not something we can accept, and we have therefore implemented what we call a "fair processing queue". Without going into too much technical detail, we can now regulate how much processing time is available for different classes of tenants and different types of processing jobs.
Overall, this allows us to remove the noisy neighbour problem so that all tenants will experience that ingesting a source entity will result in a generated view a few seconds later.
The more complicated processing scenarios still create roughly the same number of jobs. We want to underline that the fair processing queue allows multiple customers to share the same resources fairly.
Both from a cost and sustainability perspective, this was the design requirement in the fair processing queue concept we wanted to introduce.
So, we are thrilled with the design and performance of the fair processing queue.
No more over-processing
Another issue in those dark hours at the beginning of January was what we internally call "over-processing".
In some cases, updating one source entity resulted in a spike in processing jobs.
One of the key features of Enterspeed is that the developers need to worry less about cache invalidation and leave that to Enterspeed.
In the inner details of these features, we found two bugs that triggered the re-processing of too many source entities. Re-processing too many source entities and a shared queue were the two main issues for the incident in January.
Of course, we fixed the two bugs, but we also added a feature to the new fair processing queue: job de-duplication.
We can now detect duplicate jobs within the same batch of processing jobs to reduce the overall number of jobs.
This saves time and results in a win both for costs and sustainability.
Death by a thousand paper cuts
The last thing we want to mention is the small optimisations.
When working with hundreds of thousands of processing jobs - and we saw millions at peak - all the small inefficiencies start to add up to large numbers.
And this is one of the cool parts of working with some really large numbers; the small optimisations make a real difference.
One of the really interesting things was an unexpected behaviour from a specific IoC container registration that caused a spike in CPU usage.
We ended up removing all applicable IoC container registrations of IEnumerable<>, to optimise CPU usage.
We also made several changes to how we query data from our data store.
So if you want to learn more about continuation tokens and point reads in Cosmos DB, don't hesitate to get in touch.
Another tweak to our datastore was to move more of the Cosmos DB containers to tenant-specific containers for our Enterprise tier.
Overall we are very pleased with the changes, optimisations and tweaks we have introduced.
As with all software, we continue learning from our customers - you, our excellent technology partners - as the platform is being used for more than we imagined.
Until next time 👋