Benchmarking and profiling without lying to yourself

Production problem

Performance discussions become expensive when teams confuse numbers with evidence. A benchmark reports a 20 percent win, but the production service does not improve. A profiler shows a hot function, but the real issue is lock contention or off-CPU waiting. A regression slips into main because the benchmark measured the wrong input shape or because someone “optimized” dead work the compiler already removed.

This chapter is about measurement discipline. The previous chapters covered representation and cost modeling. Here the question is different: how do you gather evidence that is strong enough to change code, justify complexity, or reject a supposed optimization? The answer requires benchmark design, profiler literacy, and a refusal to let attractive charts substitute for causal reasoning.

Performance work in modern C++ is especially vulnerable to self-deception because the language exposes many powerful local transformations. You can change container type, ownership shape, inlining boundaries, allocator strategy, range pipelines, coroutine structure, and type-erasure choices. Some of those changes matter. Many do not. Measurement is how you tell the difference.

Choose the right instrument for the question

Not every performance question should start with a microbenchmark.

If you are deciding between two data representations for a tight loop, a controlled benchmark may be exactly right. If you are debugging why request latency spikes under burst load, profiling a realistic system or gathering production traces is more appropriate. If you suspect lock contention, scheduler behavior and blocking time matter more than a standalone throughput loop. If a regression appears only in end-to-end service traffic, a synthetic isolated benchmark may actively mislead.

Use a simple hierarchy:

Use microbenchmarks for narrowly scoped, well-isolated questions.
Use profilers for discovering where time or samples actually go in a process.
Use production-like load tests for interactions among queues, threads, I/O, caches, and contention.
Use production observability to confirm the change matters in the environment that pays the bill.

Confusing these layers is one of the fastest ways to waste weeks.

A benchmark must state its claim

A trustworthy benchmark starts with a sentence, not with code. “Compare lookup latency of sorted contiguous storage versus hash lookup for a read-mostly route table of 1k, 10k, and 100k entries with realistic key lengths.” That is a claim. “Benchmark containers” is not.

The sentence should identify:

The operation under test.
The data size and distribution.
The mutation versus read ratio.
The machine or environment assumptions that matter.
The decision the benchmark is supposed to inform.

If you cannot write that sentence, you do not yet know what the benchmark means.

This requirement matters because performance is workload-shaped. A benchmark over random integer keys may tell you nothing about a production router that uses string views with strong prefix locality. A benchmark over uniform hash hits may hide the collision behavior of real skewed keys. A benchmark that rebuilds a container every iteration may punish a design whose production cost is dominated by lookup after a one-time build.

Benchmark the whole relevant operation

One common lie is measuring a convenient fragment instead of the decision boundary that matters. For example, a parsing pipeline is “optimized” by measuring only token conversion after the input is already resident in cache and after allocations are pre-reserved. A container comparison measures lookup while excluding construction, sorting, deduplication, and memory reclamation even though the production workload rebuilds the structure frequently.

A benchmark does not need to be huge. It does need to include the costs the design imposes in reality. If an API choice forces allocation, copying, hashing, or validation before the line you currently measure, those costs belong in the measurement unless you can justify excluding them.

This is where the cost model from Chapter 16 should inform the benchmark. Measure the actual boundary where the costs accumulate. Otherwise the result is technically correct and operationally useless.

In a real project, the benchmarkable surfaces are often less obvious than vector<int> but more instructive. Consider the example project in examples/web-api/:

json::serialize_array() (json.cppm) iterates a range and builds a JSON array by repeated string concatenation. Benchmarking this function across varying collection sizes (10, 100, 1000 tasks) would reveal whether the concatenation strategy or the to_json() cost per element dominates, and whether pre-reserving the result string matters.
TaskRepository::find_by_id() (repository.cppm) performs a linear scan with std::ranges::find under a shared_lock. A benchmark comparing this against an unordered_map<TaskId, Task> alternative would need to include the locking cost and test at realistic repository sizes – not just the raw find.
Router::to_handler() (router.cppm) captures the route table by value and returns a lambda that performs a linear scan on every request dispatch. Benchmarking route dispatch at 5, 50, and 500 registered routes would show whether the linear scan remains acceptable or whether a sorted-vector or trie approach is justified. The benchmark must include the full dispatch path – matching method, comparing patterns, and invoking the handler – not just the loop.

These are the kinds of benchmark claims worth writing in plain language before writing code: “Compare route dispatch latency with linear scan versus sorted vector at 5, 50, and 500 routes with realistic HTTP method and path distributions.”

Control for compiler and harness artifacts

C++ can produce especially misleading microbenchmarks because the optimizer is extremely willing to remove, fold, hoist, and vectorize code that is not anchored to observable behavior. Benchmark harnesses exist partly to prevent this, but they do not eliminate the need for skepticism.

At minimum:

Ensure results are consumed in a way the compiler cannot elide.
Separate one-time setup from per-iteration work deliberately.
Warm up enough to avoid measuring first-touch effects accidentally.
Control data initialization so each iteration exercises the intended branch and cache behavior.
Inspect generated code when a surprising result appears.

If a benchmark claims that a complex operation takes almost no time, assume the optimizer removed work until proven otherwise. If a benchmark shows enormous variance, assume the environment is unstable or the workload is underspecified until proven otherwise.

Flawed benchmark: dead code elimination

This is the single most common microbenchmark lie. The compiler sees that a result is never used and removes the computation entirely:

// BROKEN: the compiler may eliminate the entire loop because
// 'total' is never observed.
static void BM_bad_dce(benchmark::State& state) {
    std::vector<double> data(1'000'000, 1.0);
    for (auto _ : state) {
        double total = 0.0;
        for (double d : data)
            total += d * d;
        // total is dead.  Optimizer removes the loop.
        // Benchmark reports ~0 ns/iteration.
    }
}

// FIXED: benchmark::DoNotOptimize prevents the compiler from
// proving the result is unused.
static void BM_good_dce(benchmark::State& state) {
    std::vector<double> data(1'000'000, 1.0);
    for (auto _ : state) {
        double total = 0.0;
        for (double d : data)
            total += d * d;
        benchmark::DoNotOptimize(total);
    }
}

benchmark::DoNotOptimize is not magic. On most implementations it acts as an opaque read of the value (often an inline asm that the compiler treats as potentially observing the variable). Use it on the final result, not on every intermediate step, or you risk inhibiting legitimate optimizations the production code would also benefit from. If you are unsure whether DCE is affecting your results, compile with -S and inspect the assembly.

Flawed benchmark: measuring setup instead of work

// BROKEN: construction cost dominates. The benchmark is
// measuring vector allocation and initialization, not lookup.
static void BM_bad_lookup(benchmark::State& state) {
    for (auto _ : state) {
        std::vector<int> v(1'000'000);
        std::iota(v.begin(), v.end(), 0);
        auto it = std::lower_bound(v.begin(), v.end(), 500'000);
        benchmark::DoNotOptimize(it);
    }
}

// FIXED: setup goes outside the timing loop.
static void BM_good_lookup(benchmark::State& state) {
    std::vector<int> v(1'000'000);
    std::iota(v.begin(), v.end(), 0);
    for (auto _ : state) {
        auto it = std::lower_bound(v.begin(), v.end(), 500'000);
        benchmark::DoNotOptimize(it);
    }
}

Flawed benchmark: wrong baseline

Comparing two designs against an unfair baseline is subtler and more dangerous:

// MISLEADING: comparing hash lookup against linear scan.
// Concludes "hash map is 100x faster" -- but the real alternative
// in production is sorted vector with binary search, which may
// be within 2x and uses half the memory.
static void BM_linear_scan(benchmark::State& state) {
    std::vector<std::pair<int,int>> data(100'000);
    // ... fill with random kv pairs, unsorted ...
    for (auto _ : state) {
        auto it = std::find_if(data.begin(), data.end(),
            [](const auto& p) { return p.first == 42; });
        benchmark::DoNotOptimize(it);
    }
}

The right baseline is the realistic alternative, not the worst possible option. Always state what the benchmark is comparing against and why that alternative is the one the team would actually choose.

Flawed benchmark: warm cache illusion

// MISLEADING: data fits in L2 cache and is hot from the previous
// iteration.  Production accesses the same structure after
// processing unrelated data that evicts it from cache.
static void BM_warm_cache(benchmark::State& state) {
    std::vector<int> v(1'000);  // ~4 KB, fits in L1
    std::iota(v.begin(), v.end(), 0);
    for (auto _ : state) {
        int sum = 0;
        for (int x : v) sum += x;
        benchmark::DoNotOptimize(sum);
    }
    // Reports ~50 ns.  In production, with cache-cold data,
    // the same operation takes 10-50x longer.
}

If the production access pattern encounters cold data, either make the working set large enough to exceed cache, or explicitly flush cache lines between iterations (platform-specific and fragile, but sometimes necessary for honest results).

Google Benchmark pitfalls

Google Benchmark (benchmark::) is widely used and generally solid, but several recurring mistakes deserve mention:

Forgetting benchmark::ClobberMemory(). DoNotOptimize prevents dead stores of a value, but it does not force the compiler to assume memory has changed. If your benchmark modifies a data structure in place, the compiler may hoist reads above writes across iterations. Use benchmark::ClobberMemory() after mutations to force a reload:

static void BM_modify(benchmark::State& state) {
    std::vector<int> v(10'000, 0);
    for (auto _ : state) {
        for (auto& x : v) x += 1;
        benchmark::ClobberMemory();
        // Without ClobberMemory, the compiler could theoretically
        // observe that v is never read and eliminate the writes,
        // or combine multiple iterations into one.
    }
}

Not using state.SetItemsProcessed(). Without it, the output shows only time per iteration, making it hard to compare benchmarks that process different batch sizes. Always call state.SetItemsProcessed(state.iterations() * num_items) so the output includes a throughput column.
Ignoring state.PauseTiming() / state.ResumeTiming() overhead. These calls use clock reads that themselves take 20-100 ns on many platforms. If the operation you are measuring takes less than a microsecond, the pause/resume overhead dominates. For sub-microsecond work, keep setup outside the loop entirely or amortize it across many iterations.
Benchmarking only one size. Use ->Range(8, 1 << 20) or ->DenseRange() to test across sizes. A design that wins at 1K elements may lose at 1M. Performance is not a scalar.

Use a serious harness when possible. The exact library is less important than the discipline: stable repetition, clear setup boundaries, and explicit prevention of dead-code elimination. When a benchmark is intentionally partial because the repository does not standardize on a harness, say so and document the omitted scaffolding.

Measuring noise instead of signal

Even with a correct benchmark, environmental noise can dominate the result. Frequency scaling, thermal throttling, background processes, NUMA effects, and interrupt coalescing all inject variance that has nothing to do with your code change.

Practical defenses:

Pin CPU frequency during benchmarks (cpupower frequency-set -g performance on Linux, or disable turbo boost). A benchmark that runs at 4.5 GHz on one iteration and 3.2 GHz on the next is measuring the governor, not your code.
Isolate cores (isolcpus kernel parameter or taskset / numactl) to prevent scheduler interference.
Run multiple trials and report median, not mean. Median is robust to occasional spikes from interrupts or page faults. Mean is dragged by outliers.
Require statistical significance before declaring a win. A 3% improvement with 5% coefficient of variation is noise. Google Benchmark supports --benchmark_repetitions=N and reports stddev; use it.
Compare on the same machine, same boot, same binary when possible. Cross-machine comparisons require careful normalization and are generally less trustworthy.

If a benchmark result changes by more than 1-2% between identical runs on an idle machine, the benchmark setup needs fixing before the result means anything.

Distribution matters more than a single number

Average runtime is a weak summary. Many production systems care about percentiles, variance, and worst-case behavior under skew. A representation that improves mean throughput while making tail latency worse under bursty allocation or lock contention may still be a regression. Likewise, a benchmark that reports only “nanoseconds per iteration” can hide bimodal behavior caused by rehash, page faults, branch predictor flips, or occasional large allocations.

Read performance numbers the way you would read availability or latency telemetry. Ask about spread, not just center. Ask whether outliers are noise, environment instability, or real behavior from the design. Ask whether the benchmark shape forces rare expensive events often enough to matter.

For CI regression control, this means using thresholds and trend analysis carefully. A noisy benchmark can create false alarms that train teams to ignore the signal. A too-forgiving threshold can let meaningful regressions accumulate. Stable benchmark design is usually more valuable than elaborate reporting.

Profilers answer different questions

A profiler is not a slower benchmark. It is a sampling or instrumentation tool for understanding where time, allocations, cache misses, or waits occur in a real process. Use it when you do not yet know where the bottleneck is, or when a microbenchmark result needs validation against full-system behavior.

Different profilers reveal different failure classes:

CPU sampling profilers answer where active CPU time is spent.
Allocation profilers answer which paths allocate and retain memory.
Hardware-counter-aware tools answer where cache misses, branch mispredicts, or stalled cycles cluster.
Concurrency and tracing tools answer where threads block, wait, or contend.

Do not ask one tool to answer a question it cannot see. A CPU profiler will not explain why threads are mostly idle waiting on a lock. An allocation flame graph will not tell you whether a faster allocator would matter if traversal cost still dominates. A wall-clock trace may show a slow request without distinguishing CPU work from scheduler delay.

On Linux, that may mean combining perf, allocator profiling, and tracing. On Windows, it may mean ETW-based tools, Visual Studio Profiler, or Windows Performance Analyzer. On macOS, Instruments fills a similar role. The tool choice is secondary to the habit: pair the question with the instrument that can actually answer it.

Correlate benchmarks with profiles

Benchmarking and profiling should constrain each other.

If a microbenchmark says a change should help because it reduces allocations, the profiler in a realistic process should show fewer allocations or less time in allocation-heavy paths. If a profile says a loop is hot because of cache misses in a pointer-rich traversal, a benchmark should isolate that traversal shape and test alternatives. If the two disagree, do not average them into comfort. Investigate the mismatch.

Common causes of mismatch include:

The benchmark data shape does not match production.
The benchmark isolated a cost that is drowned out end to end.
The profiler points at a symptom rather than the root cause.
The measured change affected code size, inlining, or branch behavior in the full binary differently than in isolation.

Good performance work narrows these gaps. Bad performance work ignores them.

Beware “representative” inputs that are not

Teams often sabotage measurement by using tidy synthetic inputs. Keys are uniformly random. Messages are the same size. Queues are never bursty. Hash tables never experience realistic load factors. Parsers never see malformed or adversarial data. These inputs are easier to generate and easier to stabilize. They are also often wrong.

Representative input does not mean copying production traffic blindly. It means preserving the properties that drive cost: size distribution, skew, repetition, mutation ratio, working-set size, and failure-path frequency. For a cache, that may mean a Zipf-like access pattern rather than uniform keys. For a parser, it may mean a realistic mix of short and long fields plus a small rate of malformed records. For a scheduler or queue, it may mean burst patterns rather than a flat arrival rate.

When data privacy or operational constraints prevent real traces, at least synthesize distributions intentionally. A benchmark over unrealistic inputs is not neutral. It actively trains the team on the wrong problem.

Performance claims must survive code review

Treat performance changes as reviewable design work, not as heroic experiments. A credible change should come with a compact evidence package:

The performance question being answered.
The benchmark or profile setup.
The workload assumptions.
The before-and-after result, including variance or percentile data when relevant.
The tradeoffs introduced: code complexity, memory footprint, API restrictions, portability, or maintenance cost.

This forces a useful discipline. It prevents “seems faster on my machine” from entering the codebase as institutional memory. It also creates artifacts future reviewers can re-run when compilers, standard libraries, or workload shape changes alter the answer.

Regression control is an engineering system, not a dashboard

It is tempting to add a benchmark job to CI and call performance solved. In practice, regression control works only when the measured benchmarks are stable, cheap enough to run at the right frequency, and tied to code paths the team actually cares about. A flaky nightly benchmark suite that no one trusts is not safety. It is ritual.

A practical setup usually includes a small set of highly stable microbenchmarks for known hot kernels, a separate heavier performance workflow for broader load tests, and production observability that tracks latency, throughput, CPU time, and memory effects after release. The layers differ in cost and fidelity. You need all three because no single layer is enough.

What honest measurement looks like

Honest measurement is modest. It does not promise universal truths from one benchmark, confuse profile heat with immediate blame, or assume an optimization matters just because it is visible in assembly. It ties a number to a workload, a question, and a decision.

Hardware changes, compilers improve, standard library implementations shift, and production traffic evolves. The habit worth building is not attachment to one profiler or framework. It is the refusal to make performance claims without evidence that matches the decision being made.

Takeaways

Pick the measurement tool that matches the question: benchmark, profile, load test, or production telemetry.
Write the benchmark claim in plain language before writing benchmark code.
Measure the full relevant operation, not the most convenient fragment.
Treat optimizer artifacts, harness mistakes, and unrealistic inputs as default suspects.
Look at variance and percentiles, not only means.
Require performance changes to carry a reviewable evidence package.

Keyboard shortcuts

Modern CPP in Practice