LinkedIn link GitHub link

A Framework for Accurate Benchmarking at Scale

We are excited to introduce our framework for benchmarking Python data analysis libraries.

At Mirai Solutions, we understand that choosing the right data analysis library can significantly impact project performance and efficiency. Our benchmarking project provides a framework to run reliable, accurate and reproducible benchmarks at scale through Python and Linux kernel tuning. The framework allows to organize and measure reproducible performance comparisons between popular libraries like pandas and Polars in a stable and objective fashion, helping teams make informed decisions based on their specific use cases.

Running and measuring benchmarks is based on pyperf, executed in Conda environments for reproducibility and to manage dependencies beyond Python (e.g., to control the usage of BLAS implementations like OpenBLAS or MKL). In particular, benchmarks can be run across several independent, fully reproducible Conda environments with locked dependencies, allowing to assess performance for alternative libraries and their versions.

System tuning for accuracy and stability

Accurate benchmarking does not only require running code and timing it, but also making sure the hardware resources are not subjected to too much overhead from other running processes outside the benchmarking measurements. Besides using statistical measurements over repeated execution, the benchmarking framework targets several kernel-level settings to tune the system for stable and accurate benchmarking, in order to reliably assess the raw performance of calculations.

Although pyperf provides some form of system tuning (via pyperf system tune), our framework targets kernel-level settings directly and explicitly, relying on pyperf for orchestrating the repeated execution and for collecting measurements data/statistics.

Our approach to system tuning incorporates several key kernel-level techniques:

  • Disabling simultaneous multithreading (SMT): eliminates/minimizes resource sharing (execution units, caches, memory bandwidth, branch predictors, …) and scheduling variability.
  • Disabling address space randomization: ensures program memory layouts remain identical between runs, eliminating variability in code and data placement that can affect caching, branch prediction, and timing.
  • Disabling Intel P-state turbo mode: locks the CPU frequency to a fixed, stable level, preventing dynamic clock boosts that can vary with temperature, power, and workload.
  • Disabling processor frequency boosting: prevents the CPU from dynamically changing its clock speed based on load or thermal conditions, ensuring a fixed, predictable frequency that eliminates run-to-run performance variation.
  • Setting scaling governor to performance: forces the processor to run at its maximum fixed frequency, preventing power-saving frequency scaling that can introduce timing variability.
  • Defining a dedicated control group: isolates CPU, memory, and I/O resources used by the benchmark process from other system processes, preventing interference and contention (see also Benchmark CPUs below).

Some settings are system-/architecture-specific and are only tuned if supported, with the original settings always restored upon completion.

Benchmark CPUs

A key aspect to accuracy and stability is to define a proper set of logical CPUs to run benchmarks on. This choice depends on the specific machine architecture, and one would usually want to

  • select a single logical CPU per core to avoid hyper-threading;
  • only consider performance cores (P-cores) on Intel architectures having both P-cores and efficiency E-cores.

In this context, the logical CPUs relevant for benchmarking can be defined by the following characteristics:

  • Smallest CPU number within the same core number; i.e., avoid hyper-threaded logical CPUs for the same core
  • CPUs with the largest maximum MHz; i.e., handle scenarios with different types of cores such as Intel P-cores and E-cores (where the latter run at a lower frequency)
  • CPUs that are actually online

The framework tries to infer relevant CPUs with the characteristics above. However, this may not be the best choice for the specific machine at hand. Therefore, the user should specify the benchmark CPUs explicitly when calling the framework to match the target processor and architecture (see Running benchmarks).

Running benchmarks

Our framework is available on GitHub as https://github.com/miraisolutions/benchmarking. From the project checkout, entry point benchmark.sh runs for a root directory, where files benches/env_<ENV_NAME>/bench_<BENCH_NAME>.py define multiple benchmarks across different environments (see below for full details about the expected structure):

./benchmark.sh -d <ROOT_DIR> --cpus <i>,<j>,...

Since the script modifies operating system settings to perform Linux kernel tuning, which are only accessible with elevated permissions via sudo, the script prompts for the user’s password.

Benchmark progress and summary results are reported in the terminal, while full benchmark results in pyperf JSON format are produced in <ROOT_DIR> as benchmark_results/env_<ENV_NAME>/bench_<BENCH_NAME>.json. These JSON files can be used to visualize basic statistics with pyperf (e.g. via pyperf stats bench_<BENCH_NAME>.json), but also processed and analyzed further to produce accurate and extensive comparison reports.

Argument --cpus is key to accurate benchmarking, as it specifies the comma-separated list of logical CPU numbers to run the benchmark on (see Benchmark CPUs above). The entry point has additional command-line arguments to consider only specific environments/benchmarks, and to control pyperf’s iterations if needed. See ./benchmark.sh --help for details.

Benchmarks definition and structure

Several benchmarks can be defined and run across different environments, under a directory benches/ with the following structure:

<ROOT_DIR>
└── benches
    ├── env_<ENV_NAME>
    │   ├── bench_<BENCH_NAME>.py
    │   ├── bench_<...>.py
    │   └── conda
    │       └── conda-linux-64.lock
    └── env_<...>
  • The same filename bench_<BENCH_NAME>.py can appear under different env_<ENV_NAME>, allowing the same benchmark to be run and be compared in different environments.
  • Several benchmarks can be defined for a given environment if they all refer to the same dependencies/libraries the environment specifies.
  • Each benchmark file is executed via python bench_<BENCH_NAME>.py (in the corresponding environment), and should define a pyperf.Runner used to benchmark the relevant code.
  • Each environment is defined by an explicit specification/lock file conda/conda-linux-64.lock, so it can be created in a fully reproducible manner. File conda-linux-64.lock contains the list of resolved platform-specific dependencies, and can be obtained via conda list --explicit --md5 (see ‘Building identical conda environments’), or managed by conda-lock.
Example: Simulated cash flow data

An example is included in the project under cashflow/, where benchmarks for computing statistics on simulated cash flow data are defined across alternative pandas and Polars implementations.

The following shows the execution output from the framework, including details about the applied system tuning:

$ ./benchmark.sh -d cashflow --cpus 0,1,2,3

  🛠 Parsed command-line arguments
   CPUs: 0 1 2 3
   Root directory: cashflow

  🛠 CPUs for benchmarking: 0 1 2 3

  🛠 Tune system for benchmarking
  [sudo] password for xxx: *

  🛠 Disabling simultaneous multithreading (SMT)
   'on'=>'off' (/sys/devices/system/cpu/smt/control)

  🛠 Disabling address space randomization
   '2'=>'0' (/proc/sys/kernel/randomize_va_space)

  🛠 Disabling Intel P-state turbo mode
   '0'=>'1' (/sys/devices/system/cpu/intel_pstate/no_turbo)

  🛠 Setting up dedicated benchmark cgroup
   Detected cgroups v2 (unified hierarchy)
   Creating benchmark cgroup (/sys/fs/cgroup/benchmark)
   Setting cpuset with CPUs 0 1 2 3 (/sys/fs/cgroup/benchmark/cpuset.cpus)
   Setting as partition root for exclusive CPU access (/sys/fs/cgroup/benchmark/cpuset.cpus.partition)
   Setting memory node 0 (/sys/fs/cgroup/benchmark/cpuset.mems)

  🛠 Setting scaling governor for CPUs 0 1 2 3
   CPU 0: 'powersave'=>'performance' (/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor)
   CPU 1: 'powersave'=>'performance' (/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor)
   CPU 2: 'powersave'=>'performance' (/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor)
   CPU 3: 'powersave'=>'performance' (/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor)

  🛠 Attaching current shell PID 7927 to the benchmark cgroup
   cgroups v2: /sys/fs/cgroup/benchmark/cgroup.procs
   cgroup for PID 7927: 0::/benchmark

  🛠 System tuning completed

  🚀 Running benchmarks
   Benchmarks directory: cashflow/benches
   Found benchmark environments: pandas polars

  🚀 Benchmark environment: pandas
   Found benches: loss_stats_by_legal_entity loss_stats_by_segment
   Creating benchmark environment: pandas

  🚀 Running benchmark loss_stats_by_legal_entity for environment pandas
   Bench script path: cashflow/benches/env_pandas/bench_loss_stats_by_legal_entity.py
   Bench results path: cashflow/benchmark_results/env_pandas/bench_loss_stats_by_legal_entity.json
  ...........
  loss_stats_by_legal_entity: Mean +- std dev: 1.47 sec +- 0.01 sec

  🚀 Running benchmark loss_stats_by_segment for environment pandas
   Bench script path: cashflow/benches/env_pandas/bench_loss_stats_by_segment.py
   Bench results path: cashflow/benchmark_results/env_pandas/bench_loss_stats_by_segment.json
  ...........
  loss_stats_by_segment: Mean +- std dev: 1.13 sec +- 0.01 sec

  🚀 Benchmark environment: polars
   Found benches: loss_stats_by_legal_entity loss_stats_by_segment
   Creating benchmark environment: polars

  🚀 Running benchmark loss_stats_by_legal_entity for environment polars
   Bench script path: cashflow/benches/env_polars/bench_loss_stats_by_legal_entity.py
   Bench results path: cashflow/benchmark_results/env_polars/bench_loss_stats_by_legal_entity.json
  ...........
  loss_stats_by_legal_entity: Mean +- std dev: 275 ms +- 10 ms

  🚀 Running benchmark loss_stats_by_segment for environment polars
   Bench script path: cashflow/benches/env_polars/bench_loss_stats_by_segment.py
   Bench results path: cashflow/benchmark_results/env_polars/bench_loss_stats_by_segment.json
  ...........
  loss_stats_by_segment: Mean +- std dev: 270 ms +- 11 ms

  🧹 Revert system tuning changes
  [sudo] password for xxx: *

  🧹 Restoring initial scaling governor state
   CPU 3: 'powersave' (/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor)
   CPU 2: 'powersave' (/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor)
   CPU 1: 'powersave' (/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor)
   CPU 0: 'powersave' (/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor)

  🧹 Restoring initial simultaneous multithreading (SMT) state
   'on' (/sys/devices/system/cpu/smt/control)

  🧹 Restoring initial address space randomization state
   '2' (/proc/sys/kernel/randomize_va_space)

  🧹 Restoring initial Intel p-state turbo mode state
   '0' (/sys/devices/system/cpu/intel_pstate/no_turbo)

  🧹 Cleaning up benchmark cgroup
   Moving PID 7927 (current shell) out of benchmark cgroup
   Removing benchmark cgroup (/sys/fs/cgroup/benchmark)

  🧹 Reverting system tuning changes completed

  🚀 Benchmarks completed successfully
  

Benefits of our approach

We believe accurate benchmarking at scale is key for teams and individuals to take informed decisions about data analysis libraries and corresponding calculation approaches for different analytical workloads, to asses alternative implementations, and to monitor the effect of alternative library versions. Our framework enables this by easily allowing the definition of a number of different calculations across alternative implementations, for arbitrary fully-reproducible isolated environments. Running the main benchmark.sh entry point orchestrates the creation of Conda environments and the repeated execution of defined calculations, with detailed benchmark measurements collected for further analysis and assessment.

The approach to benchmarking used in our framework has also an explicit focus on key system tuning aspects. This helps and empowers the user in understanding the importance of such configuration for producing stable and accurate results across different machines and architectures.

Finally, it is important to note that precise and statistically sound reporting of benchmarking results is crucial for trustworthy and interpretable performance analyses. Hoefler & Belli (2015) [doi:10.1145/2807591.2807644] show that many studies neglect rigorous statistical methods, which can lead to misleading conclusions. In this regard, accurate statistical reporting is enabled by the detailed measurements recorded as JSON files in the benchmark_results folder, which include metadata that are relevant when reporting results (such as CPU model, platform/kernel version, compiler version, clock accuracy, etc.).