- The web-service doesn’t perform
- We have a performance problem we’re investigating
Sound familiar? Is the web-service under test really the performance bottleneck? In my experience there are a myriad other considerations; load-balancers, network, application container configuration, scaling the client, external dependencies. My proposition is that the performance bottleneck isn’t always related to the resource under test!
First, let me set the stage by describing the technical components in our “Inventory LIVE!” project;
- 3 types of independently testable load-balanced web-services
- 12 Apache Tomcat JVMs
- An oracle coherence data-grid, consisting of 26 x 16G machines
- A total of 90 JVMs exposing rich JMX metrics and telemetry
- HornetQ internal messaging, abstracted from our clients
- Streaming over HTTP, using REST, JAXB, FastInfoset, XML and Jersey
- 200 million domain objects
One core goal of the Inventory LIVE! project was to deliver a ‘snapshot’ of our 200 million domain objects to our clients in a short time period. Once the functional testing effort was out of the way, we focused squarely on the complexity of performance testing. I was personally thrilled and excited about our new and unique approach to shipping such an inordinate amount of data to our clients!
With our Coherence grid fully loaded at 200 million ‘mock’ domain objects (we use a mock data loader), we launched multiple clients against the web-service to test the performance and timeliness of our HTTP streaming approach. An hour into the test, I was puzzled and had A LOT of questions for the team. Why had we only streamed 2MM offers so far? Why does it seem so much slower at scale? After all, I had successfully streamed about 100 thousand domain objects in a really short time period, but testing at scale didn’t produce comparable results.
After a few test runs with different data sets sizes and further indifferent results, we attempted to locate bottlenecks. We engaged in both peer-review and profiling (we use YourKit Java Profiler) in order to identify root causes. Unfortunately the peer-review and profiling served only to reinforce the quality and craftsmanship of my engineering counterparts. So what precisely was slowing us down?
The tomcat instances housing the web-services were under average load during our HTTP streaming test. CPU utilization was around 20-30% while memory allocation and garbage collection appeared healthy. We were convinced that our web-service JVMs could do more: there was a lot more CPU horsepower to leverage. We also focused our attention on monitoring the Coherence data-grid resource utilization. At the peak of the performance test, the data-grid grid CPU utilization was around 5% max. Still inconclusive results! We scratched our heads: where was our bottleneck?
After lengthy debate, it became apparent that the next logical step was to scale out the client layer. Rather than streaming millions of domain objects to multiple clients on a single machine, we decided to utilize multiple clients across multiple machines. The thinking was that we might be saturating the network for a single client.
Our Network Operations Team recently launched an internal cloud, ‘Cloudzilla’. Cloudzilla puts provisioning in the hands of both our quality assurance engineers and software engineers, and made deploying multiple clients across five machines a ten minute operation.
With our clients now scaled across multiple machines, I was extremely eager to observe the results of our next performance test. As I let the scaled out test loose, I was shocked and surprised to see the results! We hit 50-70% CPU utilization on all tomcat instances (much more than the previous 20-30%), and observed the same steady 5% CPU utilization across our Coherence data-grid. About an hour into the performance test, we’d streamed 100 million domain objects!
The silver bullet for us was scaling out at the client layer, which is to say that the web-service was not the bottleneck! The benefits of scaling out at the client layer included;
- Clients now had access to more memory and CPU
- Clients now had access to more network I/O
The team also decided to further scale out the web-service layer, which yielded an additional performance boost, but scaling out the client layer provided gains on a different order of magnitude.
The Inventory LIVE! project has, and continues to be, a great journey. The scale of the data, and the complexity of performance testing and monitoring 90 JVMs has made my Inventory LIVE! learning curve exponential. My performance testing experiences are probably more than for just this single post!
Comments (1)
Comment RSS | Trackback URI
Cloudzilla was launched by the Linux Sysops team
.
Pingbacks & Trackbacks (1)
April 26th, 2010 at 6:27 pm
Related Posts (4)
Leave a Comment