At Optibus, we’ve created a cloud-based platform that solves the very challenging problem of planning and scheduling public transportation (mass transit). Our platform is used to plan citywide networks of routes and timetables, and then schedule the movement of every vehicle and driver during every minute of the day. This can result in thousands of daily service trips, vehicles, and drivers being scheduled. We have developed many capabilities to solve these challenges, including the development of distributed optimization algorithms that require immense computational power. These algorithms are triggered ad hoc when such complex optimization tasks are requested.
In our case, such tasks are triggered in response to a transportation planner setting up their rules and preferences, applying their expertise, and requesting to create an optimized plan. This will happen hundreds or thousands of times during the day, but as you can imagine, the usage pattern is very volatile and unpredictable. This makes capacity planning very difficult because the demand for resources fluctuates rapidly, creating multiple spread bursts of CPU demand.
Applying the classic scale-out approach using industry standards, such as Kubernetes, is not well-suited when a burst of computational power is required for a very short period that does not have constant or predicted usage patterns. The reasons being:
On top of that, Kubernetes was not built to manage instant increases and decreases of tens or hundreds of pods required for only a few tens of seconds. It is more suited for typical scale requirements that rely on metrics, such as CPU, memory load, task queue backlog, etc. These metrics are spread across multiple machines that serve thousands of parallel requests. As a result, they don’t tend to have a sudden sharp steep curve (for the math fans this means a uniformly continuous function) as the load each new request creates is amortized. However, in our case, a new task can be viewed as the equivalent of having an increase of 10,000 new parallel requests within a few seconds, which requires a different kind of scaling logic.
This is why we chose to use AWS Lambda (Function as a Service), applying a producer-consumer pattern that works very well for this case. The advantages of using Lambda for this purpose includes:
We have planned this task in a way that we can have one data set, and the computational work can be done on small chunks and aggregated together as a single result (scatter-gather pattern).
The outline of the steps goes like this:
Using this flow, we are able to pay for exactly the required computational time and take advantage of AWS Lambda “cold start” (spin-up time) of sub seconds.
One thing to notice is that this pattern stops being effective when the execution time becomes significantly high, in a way that makes the overhead time it takes to spawn new nodes (“cold start”) and pods and the corresponding tear down overhead minor in the process. As clearly AWS Lambda has its cost, and it is significantly higher than paying for EC2 instances, especially when you take spot instances into account.
Read more:
➤Continuous Deployment in Mission-Critical Enterprise SaaS Software
➤Can We Optimize City Transportation Better Than A Museum Robber Could?
➤Optibus Raises $107 Million and Launches Geospatial Suite to Help Cities Shape the Future of Urban Mobility
➤Optibus Engineering Blog