Page 12 - EETEurope FlipBook February
P. 12

12 EE|Times EUROPE
               Performance-Regression Pitfalls Every Project Should Avoid



               possible” and gather information at specific points in time. This allows   but bits are cheap — and when you do need them, they can save time
               us to use a bisection algorithm when doing root-cause analysis. The   and frustration.
               more often we can collect data, the easier the root-cause process will   We use this technique when regressing the performance of open-
               be. Our tests run roughly six times per day, so the window for a regress-  source projects. We run our tests every eight to 10 hours instead of
               ing change to occur is roughly four hours.            against every single commit. This saves quite a few compute cycles.
                 Our test process requires that the system be rebuilt for every test.   During this testing process, we capture commit hashes and versions of
               This ensures completely reproducible results.         every library on the system so that we can fully reproduce the software
                 Figure 2 shows a typical test process from start to finish.  stack if needed. The debug process for a regression then becomes exe-
                 You’ll notice that this is a shotgun approach to validating software   cuting a git bisect against the software being tested.
               performance of the ecosystem. While a more specific or targeted   Make sure you’re measuring what you intend to measure. If you’re
               approach might be preferred, there just isn’t enough computing or   capturing data incorrectly, you may end up debugging an unrelated
               manpower in the world to test every permutation of software in a stack.   issue. For example, consider a memory bandwidth test (like Stream).
               It’s better to think of the testing process as analogous to creating a   If the block size of memory written during the test is smaller than the
               video — taking a picture at frequent-enough intervals to view perfor-  cache, the test will partially evaluate cache performance rather than
               mance as a process in motion. If issues arise, they can be revisited with   raw memory bandwidth. Every workload has important configuration
               a slow-motion camera to capture additional detail.    requirements; make sure you’re doing due diligence and being inten-
                                                                     tional with data capture.

                                                                     Decouple data acquisition and analysis.
                                                                     Design the test strategically, but be sure the data can be leveraged
                                                                     beyond the initial capture. In other words, keep the raw data. Capturing
                                                                     and maintaining raw data will support a richer level of analysis after
                                                                     the fact.
                                                                       In one workload, we discovered that things ran twice as fast on
                                                                     Ubuntu as on CentOS. We had the complete kernel configuration and
                                                                     software settings available as raw data and developed a series of studies
                                                                     by diffing the configurations of each OS distribution that were pulled
                                                                     from the systems at test time. We then took those studies to bare-metal
                                                                     systems to verify our differing hypothesis. By shifting and automating
                                                                     this analysis to raw data off-system, we saved days of engineering effort
               Figure 2: This procedure generates a rolling set of results that are   and system time. Multiply those savings across hundreds of workloads,
               captured multiple times a day (see Figure 3).         and you impact your schedule in a very positive way.

                                                                     Record and control system configurations.
                250
                                                                     It’s no secret that the hardware configuration is extremely important
                240
                                                                     when measuring performance. A performance result collected on a
                MiB/s (HIB)   220                                    laptop will be much different from one collected on a top-end server
                230
                210                                                  platform. There are even differences among deployments of the same
                                                                     CPU and platform. The memory configuration and speed will impact
                200
                     Sep                                             Nov                                        Jan 2020                                       Mar                                            May                                              Jul
                                                                     memory performance, storage technology will impact I/O performance,
               Figure 3: Results for tests of a specific workload show the effects   and many tuning factors will impact compute performance. It’s all rela-
               of specific updates (large dots) on performance. The red dot    tive to your deployed and tested configuration paired with the workload
               indicates the point at which the results dropped below the    that you’re measuring.
               computed regression threshold.                          Only regress and compare against “like” systems. What that means will
                                                                     differ from test program to test program, but I would go so far as to require
                                                                     the same exact model/SKU/version of each major hardware component.
               KEYS TO SUCCESSFUL TESTING                            We typically will not compare regression results between two systems in
               Start with a strategy.                                which any of the major hardware components differ in these key areas.
               A test is only as effective as its design. Before capturing any data,
               think through your strategy. Start by understanding exactly what   Create a standardized test format and language.
               you’re trying to measure and work backward. For example, when test-  Of course, you can’t debug an issue if you can’t reproduce the result.
               ing memory bandwidth, ensure you don’t involve the caches, or you’ll   This includes not just the test tool and system configuration but also
               get misleading data.                                  the method of returning and interpreting the results. Methods for
                 Figure out which characteristics of the system you’re trying to measure.   parsing results can vary wildly among test libraries. Communicating
               Identify elements that need to remain static between runs and which can   workload flags can become a game of telephone in which bits of infor-
               (and should) change between runs. A good plan means good results.  mation are lost as the message passes from engineer to engineer.
                                                                       It’s common to see something like this communicated between engi-
               Be careful with data capture.                         neers doing performance debug:
               The purpose of continuous performance-regression testing is not just   fio --filename=devicename --direct=1 --rw=randread --bs=4k
               to catch regressions but also to assist in identifying the root causes. If   --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based
               you don’t collect the right information during test cycles, you can burn   --group_reporting --name=iops-test-job --eta-newline=1 --readonly
               a lot of time just trying to reproduce a previous result. It’s a good idea   This command line is specific and will generate reproducible results.
               to capture extra git hashes or other versioning information to increase   It’s cryptic, however, and doesn’t lend itself to database storage or
               granularity during debug. You won’t use 90% of the data you collect,   easy comparison.


               FEBRUARY 2021 | www.eetimes.eu
   7   8   9   10   11   12   13   14   15   16   17