Optibus creates a continuous plan for mass transit: city-wide planning and scheduling – determining where every vehicle and driver should go, while ensuring that the service will be both cost-effective and on-time. If you’re familiar with scheduling problems (the trade name for these problems is vehicle and crew scheduling, aka blocking and runcutting, as well as rostering) you probably know that Optibus deals with NP-hard problems. We do this by applying sophisticated distributed optimization algorithms and AI (to predict, among other things, the likelihood of on-time service) . Addressing these NP-Hard problems is compute intensive, and as a result, one of the challenges we face is how to harness and parallelize cloud computing resources in a smart way that will give us high distribution capabilities with a minimal cost overhead.
In this post I’ll discuss why we’re using Function as a Service on top of using cloud-based resources. It’s true that using the cloud one can always get more resources when they are required, or reduce resources when they are not needed, yet it isn’t that simple in burst-based compute intensive situations such as ours, and this is where FaaS comes into play. Users of Optibus (typically transit schedulers and planners) use our system to quickly create transit plans and schedules. By doing that, our backend system immediately begins several “burst-based” compute intensive operations, all running in a parallelized way, distributing the data across many different nodes.
Let’s start with an example: a scenario where we need to go over a huge list of strings, and apply some parsing logic, and that this scenario happens in a response to a sporadic event:
We could distribute the parsing logic across as many nodes as possible, and give each node a very small portion of the list to maximize distribution (theoretically even a single string). In this case each node will parse some strings and we’ll aggregate the results afterwards. This is a fairly simple MapReduce problem.
This simple hypothetical scenario happens in Optibus in a more complex form, where instead of scanning a huge list of strings, we need to optimize the transportation of an entire city (think about it as a huge list of transportation related demand with supply constraints). Like in the strings example, we decompose the problem into many sub-problems (this part is non trivial by itself and worth its own blog post, sometime in the future). We then optimize each sub-problem separately (here’s an example of depots) and combine it together into an aggregated optimization problem. This process repeats itself many times with different parameters until convergence.
There is an inherent tradeoff between performance and efficiency, which is controlled by the number of nodes we will use. Having a huge amount of nodes running all the time is only efficient when they are utilized most of the time. If we have too many nodes up and running it will be very rare that we can have them operate at anything close to 100% capacity. The result will be too many idle nodes, waiting to execute jobs. This is an expensive outcome. Another alternative is to spawn the nodes on demand (only when the job is required). However, in this case, spawning a node takes 30-60 seconds until it is available: you will both pay for execution time and the time the node is booting (idle). So spawning 1,000 nodes will result in 500-1,000 minutes of wasted cost and an extra latency of 30-60 seconds until the job starts processing.
The third option is to use function as a service (FaaS) to gain maximum distribution for burst jobs. This is what we do at Optibus. FaaS such as AWS Lambda gives the option to only deploy a single function, execute it and only pay for the portion of time it ran. The “spawn” time is not something you pay for or even control. It is for the cloud provider to try and have available containers to execute the method. Since the cloud provider can share resources across all of its accounts, it is a lot easier for the provider to utilize the infrastructure much better and have nodes available for burst operations, amortizing the resource usage between many different accounts.
What’s most interesting in the mass transit world is the massive amount of work it takes to return optimized results, reducing costs for the mass transit provider and allowing for a better service for passengers. We’re talking about billions of options that need to be considered, and then optimized, while also returning results really quickly. That’s why one of our first and foremost design goals was to have optimizations run in seconds or minutes (and they do). This is also where we use FaaS, so that we can improve performance and scale on demand.
Using FaaS for burst jobs comes with its fair share of complications. You need to make sure each method invocation ends within the time limit of the FaaS provider (e.g. 5 minutes for AWS Lambda), you need to smartly and efficiently serialize and deserialize data so as not to cause too much constant overhead. You will need to make sure the data the functions require can be accessed in high concurrency as you will have many parallel invocations. This constant overhead is the bottleneck for the distribution performance (Amdahl’s law). Another non trivial issue is losing JIT (just in time compilation) optimizations in languages that support JIT. The reason is because each function that is executed can start on a fresh container that didn’t undergo a “warmup” process, which adds again to the inefficiency (which you pay for). One way to deal with this is to build pre “warmed up” containers during the build process and snapshot their memory state to disk and then have the function init code start by loading these containers as the entry point.
Basically the “inefficiency” is moved from having idle nodes or spawn time, to the constant overhead of getting the data to the function and having the function run efficiently (JITed) so this is where you should focus your efforts. By minimizing it you can theoretically use 100’s or 1000’s of concurrent functions, based on what the underlying provider gives you.
When we were developing this mechanism we checked different alternatives as our FaaS platform, we chose to use Binaris’ platform because first and foremost it is much faster than the alternatives and more cost effective. Moreover we needed support for pypy python interpreter which is supported out of the box within Binaris. By combining the two we were able to improve our performance and cost utilization by 20x fold in critical aspects of our system.
Happy scaling!