Page 12 - EETEurope FlipBook February
P. 12
12 EE|Times EUROPE
Performance-Regression Pitfalls Every Project Should Avoid
possible” and gather information at specific points in time. This allows but bits are cheap — and when you do need them, they can save time
us to use a bisection algorithm when doing root-cause analysis. The and frustration.
more often we can collect data, the easier the root-cause process will We use this technique when regressing the performance of open-
be. Our tests run roughly six times per day, so the window for a regress- source projects. We run our tests every eight to 10 hours instead of
ing change to occur is roughly four hours. against every single commit. This saves quite a few compute cycles.
Our test process requires that the system be rebuilt for every test. During this testing process, we capture commit hashes and versions of
This ensures completely reproducible results. every library on the system so that we can fully reproduce the software
Figure 2 shows a typical test process from start to finish. stack if needed. The debug process for a regression then becomes exe-
You’ll notice that this is a shotgun approach to validating software cuting a git bisect against the software being tested.
performance of the ecosystem. While a more specific or targeted Make sure you’re measuring what you intend to measure. If you’re
approach might be preferred, there just isn’t enough computing or capturing data incorrectly, you may end up debugging an unrelated
manpower in the world to test every permutation of software in a stack. issue. For example, consider a memory bandwidth test (like Stream).
It’s better to think of the testing process as analogous to creating a If the block size of memory written during the test is smaller than the
video — taking a picture at frequent-enough intervals to view perfor- cache, the test will partially evaluate cache performance rather than
mance as a process in motion. If issues arise, they can be revisited with raw memory bandwidth. Every workload has important configuration
a slow-motion camera to capture additional detail. requirements; make sure you’re doing due diligence and being inten-
tional with data capture.
Decouple data acquisition and analysis.
Design the test strategically, but be sure the data can be leveraged
beyond the initial capture. In other words, keep the raw data. Capturing
and maintaining raw data will support a richer level of analysis after
the fact.
In one workload, we discovered that things ran twice as fast on
Ubuntu as on CentOS. We had the complete kernel configuration and
software settings available as raw data and developed a series of studies
by diffing the configurations of each OS distribution that were pulled
from the systems at test time. We then took those studies to bare-metal
systems to verify our differing hypothesis. By shifting and automating
this analysis to raw data off-system, we saved days of engineering effort
Figure 2: This procedure generates a rolling set of results that are and system time. Multiply those savings across hundreds of workloads,
captured multiple times a day (see Figure 3). and you impact your schedule in a very positive way.
Record and control system configurations.
250
It’s no secret that the hardware configuration is extremely important
240
when measuring performance. A performance result collected on a
MiB/s (HIB) 220 laptop will be much different from one collected on a top-end server
230
210 platform. There are even differences among deployments of the same
CPU and platform. The memory configuration and speed will impact
200
Sep Nov Jan 2020 Mar May Jul
memory performance, storage technology will impact I/O performance,
Figure 3: Results for tests of a specific workload show the effects and many tuning factors will impact compute performance. It’s all rela-
of specific updates (large dots) on performance. The red dot tive to your deployed and tested configuration paired with the workload
indicates the point at which the results dropped below the that you’re measuring.
computed regression threshold. Only regress and compare against “like” systems. What that means will
differ from test program to test program, but I would go so far as to require
the same exact model/SKU/version of each major hardware component.
KEYS TO SUCCESSFUL TESTING We typically will not compare regression results between two systems in
Start with a strategy. which any of the major hardware components differ in these key areas.
A test is only as effective as its design. Before capturing any data,
think through your strategy. Start by understanding exactly what Create a standardized test format and language.
you’re trying to measure and work backward. For example, when test- Of course, you can’t debug an issue if you can’t reproduce the result.
ing memory bandwidth, ensure you don’t involve the caches, or you’ll This includes not just the test tool and system configuration but also
get misleading data. the method of returning and interpreting the results. Methods for
Figure out which characteristics of the system you’re trying to measure. parsing results can vary wildly among test libraries. Communicating
Identify elements that need to remain static between runs and which can workload flags can become a game of telephone in which bits of infor-
(and should) change between runs. A good plan means good results. mation are lost as the message passes from engineer to engineer.
It’s common to see something like this communicated between engi-
Be careful with data capture. neers doing performance debug:
The purpose of continuous performance-regression testing is not just fio --filename=devicename --direct=1 --rw=randread --bs=4k
to catch regressions but also to assist in identifying the root causes. If --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based
you don’t collect the right information during test cycles, you can burn --group_reporting --name=iops-test-job --eta-newline=1 --readonly
a lot of time just trying to reproduce a previous result. It’s a good idea This command line is specific and will generate reproducible results.
to capture extra git hashes or other versioning information to increase It’s cryptic, however, and doesn’t lend itself to database storage or
granularity during debug. You won’t use 90% of the data you collect, easy comparison.
FEBRUARY 2021 | www.eetimes.eu

