Performance Monitoring of Injector Node Recommendations
As the lead performance engineer at SparkPost, I get asked to simulate a lot of different customer injection profiles to search for possible problems with the performance of a given approach. While the specifics of each of our customers can be massively different, the reality is that the hardware problem areas that we need to worry about are relatively small in number. For that, I can give you a handful of tricks and tips that I use to find the problems in performance on my injector nodes. Injector nodes are the name we give to any server that is doing REST or SMTP injections into a SparkPost environment. The key here is to find the performance bottleneck and focus on that problem.
Something like low throughput is just a symptom of a different problem all together. The key is to find the bottleneck and eliminate it. The first thing to consider is what do I monitor? The second is how closely do I monitor? So if you are interested in CPU utilization, the act of monitoring CPU utilization burns CPU. So if you aren’t careful how closely you are monitoring something, you run the risk of the monitoring becoming your bottleneck. Over time, I’ve settled on doing monitoring every 60 seconds during all my performance tests. Mostly because it’s often enough to allow me decent graphs and ends up doing very little in the way of damage to my overall performance of the product. So the big areas of concern for generic performance monitoring are as follows: CPU, Memory, Disk I/O, and Network. These four areas represent the cornerstones of the performance of a given environment during an injection. The following tools are for Linux environments you will need to have the packages sysstat and procps installed to use the commands listed below.
For CPU, there is a great Linux tool for the end user that will provide a sufficient amount of information. The mpstat command gives you a lot of information about the CPU’s activities. Now I recommend the –P ALL because it allows you to see down into what each individual CPU is doing. On more than one occasion this has revealed that a program wasn’t multithreading for me. What you’ll see in that case is a single CPU running at near 100% utilization and the rest of them are basically doing nothing. And if you just do a plain mpstat you can’t see it because the totals are averaged across all the available CPU’s you have. So when you are looking for the performance bottleneck make sure that maximizing the use of your CPUs.
$ mpstat -P ALL Linux 2.6.32-573.3.1.el6.x86_64 (ws) 02/02/2016 _x86_64_ (16 CPU) 02:43:28 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 02:43:28 PM all 0.03 0.00 0.03 0.03 0.00 0.00 0.00 0.00 99.91 02:43:28 PM 0 0.03 0.00 0.02 0.02 0.00 0.00 0.00 0.00 99.93 02:43:28 PM 1 0.01 0.00 0.01 0.01 0.00 0.00 0.00 0.00 99.96 02:43:28 PM 2 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 99.98 02:43:28 PM 3 0.01 0.00 0.01 0.38 0.00 0.00 0.00 0.00 99.61 02:43:28 PM 4 0.01 0.00 0.06 0.04 0.00 0.00 0.00 0.00 99.89 02:43:28 PM 5 0.01 0.00 0.02 0.01 0.00 0.00 0.00 0.00 99.96 02:43:28 PM 6 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 99.99 02:43:28 PM 7 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 99.99 02:43:28 PM 8 0.02 0.00 0.02 0.01 0.00 0.00 0.00 0.00 99.95 02:43:28 PM 9 0.11 0.00 0.08 0.01 0.00 0.00 0.00 0.00 99.80 02:43:28 PM 10 0.01 0.00 0.02 0.01 0.00 0.00 0.00 0.00 99.97 02:43:28 PM 11 0.08 0.00 0.06 0.03 0.00 0.01 0.00 0.00 99.83 02:43:28 PM 12 0.01 0.00 0.01 0.01 0.00 0.00 0.00 0.00 99.97 02:43:28 PM 13 0.16 0.00 0.08 0.01 0.00 0.00 0.00 0.00 99.76 02:43:28 PM 14 0.01 0.00 0.02 0.00 0.00 0.00 0.00 0.00 99.96 02:43:28 PM 15 0.02 0.00 0.01 0.00 0.00 0.00 0.00 0.00 99.97
Another area to find bottlenecks is in memory utilization. Some times the injection tools we use make dynamic memory errors and it helps to monitor memory on your environment so you can detect when something like this has happened. Depending on the language used for your injector this may or may not be an issue. I use the free command for this kind of check. It’s program independent so you can detect generic leaks in your injector environment easily with it. Free has –k, -m, and –g to switch between kilobytes, megabytes and gigabytes.
$ free -k total used free shared buffers cached Mem: 32869504 6523148 26346356 1316 434112 5206944 -/+ buffers/cache: 882092 31987412 Swap: 16777212 0 16777212
This gives you a higher level overview of what’s happening with memory on the system. pidstat can also provide process level monitoring if you aren’t clear what process is using up your memory during an injection. But generally it would be easier to just use a generic top and look at the utilization of memory there to look for the guilty party in a memory leak investigation. But at this point, you’ll switch to the proper profiling tool of your preference rather than depending on external tools so you can get the specific details of the problem.
One area that your injector can suffer performance problems is in how it’s disk I/O ends up working. If your injector program is making lots of calls to disk inefficiently, it can end up slowing itself down without you even realizing it. Additionally database calls can also cause problems on this front, especially if your database is doing some garbage collection during your run that you are unaware of. The easiest way to detect this is to use the iostat command. I usually use iostat –m –t when I’m monitoring.
$ iostat -m -t Linux 2.6.32-573.3.1.el6.x86_64 (ws) 02/02/2016 _x86_64_ (16 CPU) 02/02/2016 03:07:42 PM avg-cpu: %user %nice %system %iowait %steal %idle 0.03 0.00 0.03 0.03 0.00 99.91 Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sda 0.84 0.00 0.01 4943 28208 dm-0 2.91 0.00 0.01 782 27231 dm-1 0.00 0.00 0.00 1 0 dm-2 0.13 0.00 0.00 4144 976
The transactions per second (tps) value, is actually the illusive IOPS number that you’ve been looking for. It’s a more generic way to talk about how much the disk is being utilized. Once again, depending on how you have your environment setup, you maybe over utilizing a single disk doing reads and writes for your injector. RAM drives and RAID setups are easy ways to spread the work out between multiple disks to easily increase your efficiency.
The final generic area of monitoring during your injections should probably be related to networking. Sometimes you check everywhere else and it appears that everything is working normally but throughput is still suffering. So the tool you need to use is sar to see what the status of the NIC card you are using to communicate with the outside world.
$ sar -n DEV 1 1 Linux 2.6.32-573.3.1.el6.x86_64 (ws) 02/02/2016 _x86_64_ (16 CPU) 03:14:51 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s 03:14:52 PM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 03:14:52 PM eth0 6.00 1.00 0.85 0.18 0.00 0.00 3.00 03:14:52 PM eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Your focus should be the eth cards and oftentimes you’ll see one that’s busy but if you have others, they’re usually not. T and R are transmitting and receiving respectively, so the key here is to look at those numbers and look for a bottleneck. For example, the graph below is a combined total of both values showing a network bottleneck.
The combined throughput maxed out the capacity of the environment for the entire period of the hour long test run. By monitoring networking, I was able to see that and point to it as the bottleneck that we needed to focus on.
Using all the above tools, it’s an easy way to gain insights into what’s happening on your system and what sorts of things you should be focusing on as you are trying to optimize the performance of your injector. Sometimes, something as simple as putting your disk in RAID 10, for example, can pay off with massive dividends on database behavior. This allows you to extend the life of existing hardware or focus future purchases, specifically on the bottleneck your software is facing today.