Statistical Methods for Reliable Benchmarks
Benchmarking is critical for performance-sensitive code. Yet most developers approach it with surprisingly crude methods: run some code, measure the time, compare the average against another piece of code. This approach is fundamentally flawed, and the numbers it produces can be actively misleading.
The good news is that there are simple statistical techniques that give us a much better understanding of how code actually performs. These techniques apply to every language, but for this post I will focus on Dart. I have written a package called benchmark_harness_plus that implements everything discussed here.
The Problem with Averages
Consider a simple benchmark that runs 10 times:
Run 1: 5.0 us
Run 2: 5.1 us
Run 3: 4.9 us
Run 4: 5.0 us
Run 5: 5.2 us
Run 6: 4.8 us
Run 7: 5.0 us
Run 8: 5.1 us
Run 9: 4.9 us
Run 10: 50.0 us <- GC pause
The mean (average) is 9.5 us. But does this represent typical performance? Absolutely not. Nine out of ten runs completed in about 5 us. The mean is nearly double the actual typical performance because a single garbage collection pause skewed everything.
This is not a contrived example. GC pauses, OS scheduling, CPU throttling, and background processes constantly interfere with measurements. In real benchmarks, outliers are the norm, not the exception.
The Solution: Median
The median is the middle value when samples are sorted. For the data above:
Sorted: [4.8, 4.9, 4.9, 5.0, 5.0, 5.0, 5.1, 5.1, 5.2, 50.0]
Median: 5.0 us (average of the two middle values)
The median correctly reports 5.0 us, completely ignoring the outlier. This is why benchmark_harness_plus uses median as the primary comparison metric.
When to look at mean vs median:
The relationship between mean and median tells you about your data distribution:
- Mean ≈ Median: Symmetric distribution, no significant outliers
- Mean > Median: High outliers present (common in benchmarks, caused by GC and OS)
- Mean < Median: Low outliers present (rare, might indicate measurement issues)
When you still need the mean
As editor_of_the_beast pointed out to me, referencing Marc Brooker's post Two Places the Mean Isn't Useless, the mean remains essential for capacity planning and throughput calculations. If you want to know how many requests per second your system can handle, you need the mean latency, outliers and all. Those GC pauses consume real time and affect actual throughput.
Little's Law (L = λ × W) only works with means, not medians or percentiles. If you need to calculate how many concurrent connections you can sustain, or how much buffer space you need, the mean is irreplaceable.
The distinction is this: for comparing which implementation is faster under typical conditions, use the median. For calculating system capacity where every millisecond counts toward the total, use the mean.
But how do I know if I can trust the results?
This is the question most benchmarking tools fail to answer. You get a number, but is it reliable? Could the next run produce something completely different?
The answer is the Coefficient of Variation (CV%).
CV% expresses the standard deviation as a percentage of the mean:
CV% = (standard deviation / mean) * 100
This normalizes variance across different scales. A standard deviation of 1.0 means very different things for a measurement of 10 us versus 1000 us. CV% makes them comparable.
Trust thresholds:
| CV% | Reliability | What it means |
|---|---|---|
| < 10% | Excellent | Highly reliable. Trust exact ratios. |
| 10-20% | Good | Rankings are reliable. Ratios are approximate. |
| 20-50% | Moderate | Directional only. You know which is faster, but not by how much. |
| > 50% | Poor | Unreliable. The measurement is mostly noise. |
When benchmark_harness_plus reports CV% > 50%, it warns you explicitly. You should not trust those numbers.
The Complete Picture
Here is what proper benchmark output looks like:
Variant | median | mean | fastest | stddev | cv% | vs base
--------------------------------------------------------------------------------
growable | 1.24 | 1.31 | 1.05 | 0.15 | 11.5 | -
fixed-length | 0.52 | 0.53 | 0.50 | 0.02 | 3.8 | 2.38x
generate | 0.89 | 0.91 | 0.85 | 0.04 | 4.4 | 1.39x
(times in microseconds per operation)
How to read this:
-
Check CV% first. All values are under 20%, so these measurements are reliable.
-
Compare medians. fixed-length (0.52 us) is fastest, growable (1.24 us) is slowest.
-
Look at mean vs median. The growable variant has mean (1.31) > median (1.24), suggesting some high outliers. The others are close, indicating symmetric distributions.
-
Check the ratios. fixed-length is 2.38x faster than growable. Because both have good CV%, this ratio is trustworthy.
What benchmark_harness_plus does differently
The standard benchmark_harness package reports a single mean value. benchmark_harness_plus implements several statistical best practices:
1. Multiple Samples
Instead of one measurement, the package collects multiple independent samples (default: 10). Each sample times many iterations of the code, then records the average time per operation. This gives us enough data points to compute meaningful statistics.
2. Proper Warmup
Before any measurements, each variant runs through a warmup phase (default: 500 iterations). This allows:
- The Dart VM to JIT-compile hot paths
- CPU caches to warm up
- Lazy initialization to complete
Warmup results are discarded entirely.
3. Randomized Ordering
By default, the order of variants is randomized for each sample. This reduces systematic bias from:
- CPU frequency scaling
- Thermal throttling
- Memory pressure changes over time
If variant A always runs before variant B, the second variant might consistently benefit from (or suffer from) the state left by the first.
4. Reliability Assessment
Every result includes CV%, and the package provides a reliability property that categorizes results as excellent, good, moderate, or poor. You no longer have to guess whether your numbers are meaningful.
Usage
import 'package:benchmark_harness_plus/benchmark_harness_plus.dart';
void main() {
final benchmark = Benchmark(
title: 'List Creation',
variants: [
BenchmarkVariant(
name: 'growable',
run: () {
final list = <int>[];
for (var i = 0; i < 100; i++) {
list.add(i);
}
},
),
BenchmarkVariant(
name: 'fixed-length',
run: () {
final list = List<int>.filled(100, 0);
for (var i = 0; i < 100; i++) {
list[i] = i;
}
},
),
],
);
final results = benchmark.run(log: print);
printResults(results, baselineName: 'growable');
}
The package includes three configuration presets:
BenchmarkConfig.quick // Fast feedback during development
BenchmarkConfig.standard // Normal benchmarking (default)
BenchmarkConfig.thorough // Important performance decisions
You can also create custom configurations:
BenchmarkConfig(
iterations: 5000,
samples: 20,
warmupIterations: 1000,
randomizeOrder: true,
)
When Measurements Are Unreliable
If you see CV% values above 50%, your measurements are dominated by noise. Common causes:
Sub-microsecond operations. Very fast code is inherently difficult to measure accurately. Timer resolution becomes a limiting factor. Solution: increase iterations so each sample takes at least 10ms.
System interference. Background processes, browser tabs, other applications. Solution: close unnecessary programs, or accept that some variance is unavoidable.
Inconsistent input. If the code under test behaves differently based on input, and you are using random input, variance will be high. Solution: use deterministic test data.
The operation is genuinely variable. Some code has inherently variable performance (cache-dependent algorithms, I/O, network calls). In these cases, high CV% is not a measurement problem; it is telling you something true about the code.
Summary
The core techniques are simple:
- Use median, not mean. Median ignores outliers.
- Collect multiple samples. One measurement tells you almost nothing.
- Report CV%. Know whether you can trust your results.
- Warm up before measuring. Let the JIT do its work.
- Randomize variant order. Reduce systematic bias.
These principles apply to any language. For Dart, benchmark_harness_plus implements all of them with sensible defaults.
The package is available at pub.dev/packages/benchmark_harness_plus.
Addendum: The Case for the Fastest Time
Bob Nystrom from the Dart language team pointed out that the fastest time has a special property: it is an existence proof. If the machine ran the code that fast once, that represents what the code is actually capable of. Noise from GC, OS scheduling, and other interference can only add time, never subtract it. The minimum filters out that external noise and shows the algorithm's true potential.
This approach works well when comparing pure algorithms where you want to isolate the code's performance from system interference. For more complex cases involving throughput or real-world conditions, the noise is part of what you are measuring and should not be filtered out.
I have added a "fastest" column to benchmark_harness_plus (as of version 1.1.0) so this metric is now visible alongside median and mean.
Different Metrics for Different Questions
What has become clear from these discussions is that different metrics answer different questions:
-
Fastest (minimum): "How fast can this code run?" An existence proof of capability. Best for comparing pure algorithms where you want to isolate the code from system noise.
-
Median: "How fast does this code typically run?" Robust against outliers. Best for understanding typical performance under normal conditions.
-
Mean (average): "What is the total time cost?" Essential for capacity planning and throughput calculations where every millisecond counts toward the total.
There seems to be a gap in how we talk about benchmarking. We use the same word for very different activities: comparing algorithm efficiency, measuring system throughput, profiling latency distributions, and capacity planning. Each requires different statistical treatment, yet we often reach for the same crude tools.
Perhaps what we need is a clearer taxonomy of benchmarking types, with explicit guidance on which metrics matter for each. The fastest time, the median, and the mean are all valuable, but they answer fundamentally different questions. Knowing which question you are asking is the first step to getting a meaningful answer.
On GC Triggering
An earlier version of this package attempted to trigger garbage collection between variants by allocating and discarding memory. Vyacheslav Egorov from the Dart Compiler team pointed out that this is counterproductive: the GC is a complicated state machine driven by heuristics, and allocations can cause it to start concurrent marking, introducing more noise rather than reducing it.
The GC triggering logic has been removed as of version 1.2.0. A better approach for Dart 3.11+ is to use the dart:developer NativeRuntime API to record timeline events and check whether any GC occurred during the benchmark run, making GC visibility part of the report rather than trying to prevent it.