The culprit was the data size that was allocated to each pure POSIX thread (8MB). For each concurrent thread that was executing, 8MB of data was assigned to it. Multiply 8MB by 400 threads (roughly 3.2GB) and you start to see that we run out of memory far before we run out of CPU resources. We attempted to fix the issue by using setrlimit(RLIMIT_DATA) and setrlimit(RLIMIT_STACK), but to no avail. Setting the stack size via POSIX threads was inconsistent in its effects – sometimes the stack size was modified but was then increased to the 8MB limit. The kernel was automatically re-adjusting the value to 8MB for reasons that we have yet to discover. Any theories or explanations regarding this behavior would be greatly appreciated.
The pure POSIX threads approach was running out of memory, far short of the goal of scaling to 100K concurrent web service requests. The two MODEST-based approaches seemed to be scaling quite nicely.
There are several other things that are shown in the graph that are important to note.
The first thing that you may notice from the graph is that the MODEST operations and fibers have a performance advantage over the pure POSIX thread-based approach when things are going as planned. The speed-up is roughly 100% using MODEST operations and fibers instead of pure POSIX threads. This is largely due to the kernel not having to schedule as many threads as the pure POSIX thread based approach, thus reducing the amount of context switching time.
The second thing that you may notice is that the fiber approach seems to be slower than the operation-based approach. What the graph does not display, however, is the latency between task progress. The pure POSIX thread and fiber-based approach are truly concurrent, meaning that work is being completed across all web service requests at an even pace. The operation-based approach is not truly concurrent, meaning that only 4 operations run to completion at any given time – all other operations are waiting for a processor to become available before executing to completion.
So, while throughput stays the same for operations vs. fibers – latency suffers in the operation-based approach. The reason that the fiber-based approach (blue line) is slightly slower than the operation-based approach (green line) is because the fiber scheduler is running after each fiber completes one iteration (out of a total of 50 iterations) of a work unit. The operation-based approach never has to run a fiber scheduler.
Dave Longley, Digital Bazaar’s CTO, deserves most (if not all) of the credit for doing a tremendous job researching, designing and developing the MODEST Web Services Processor. Without his attention to detail, excellent design, and penchant for solving the hard problems, the technology that will carry Digital Bazaar into the future would not exist.
David Lehn, Digital Bazaar’s Master of Ceremonies and Long-term Technology Strategy, was responsible for performing the scalability research and convincing us to run these tests in the first place. When we discounted his desire to perform scalability testing, writing it off as an academic endeavor, he didn’t listen to us and performed the scalability testing anyway. Through his tireless efforts, he has thoroughly convinced us that there is great value in performing this sort of scalability testing during development. What can we say, we love Dave Lehn.
This concludes part one of this two-part series. The next post on MODEST will explain how we scale past the pure POSIX thread-based approach and finally achieve good throughput on 100K+ concurrent requests, which is shown in the graph above.