Skip to content

Reference

Every flag, marker, fixture, CLI command, and public function. The CLI and Python API below are rendered live from the source; the pytest surface (flags, marker, fixture, blob schema) is curated here. For the narrative versions see Quickstart, Choosing a metric, Grouping by dims, and Compare & gate CI.

pytest command-line flags

The plugin adds these to any pytest run (alongside pytest-benchmark's own flags). This table is generated from the plugin's own --help text, so it can't drift from the code:

Flag Default What
--benchmark-memory off Record peak memory (a memray pass) for every benchmark() call, not just the benchmark_memory fixture — no test changes. Off by default; the fixture is always measured, with or without this flag.
--benchmark-memory-repeats=N Force a fixed number of memray passes per benchmark, suite-wide; the reported peak is the min across them. Overridden per-test by @pytest.mark.benchmem(repeats=N). Default: adaptive — run passes until the min floor settles (≥2, cap 10). Set this for a fixed, reproducible count (e.g. CI gating against a saved baseline).
--benchmark-memory-warmup=N 1 Untracked dry-runs of the action before measuring, suite-wide, to shed one-time costs (lazy imports, first-touch caches) so the measured passes aren't inflated by cold start. Overridden per-test by @pytest.mark.benchmem(warmup=N). Default: 1; set 0 to disable.
--benchmark-memory-max-time=SECONDS Wall-clock budget for the adaptive memory passes (the analogue of --benchmark-max-time): caps how long adaptive sampling spends per benchmark. Ignored when --benchmark-memory-repeats forces a fixed count. Default: no time bound — the pass cap alone bounds it.
--benchmark-memory-compare=REF off Compare this run's peak memory against a prior saved run (a pytest-benchmark storage ref like 0001, or the latest if no value is given); folds base and delta-peak columns into the table.
--benchmark-memory-compare-fail=FIELD:THRESHOLD Fail the session on a memory regression, e.g. peak:10%, peak:5MiB, allocations:5% (repeatable). Fields: peak, allocated, allocations, rss (rss needs isolated runs). Implies --benchmark-memory-compare.
--benchmark-memory-profile=DIR Save the memray profile (.bin) into DIR — render later with memray flamegraph (or tree/summary). Scope follows the gate: WITH --benchmark-memory-compare-fail only the regressing ids, otherwise EVERY measured benchmark. Off by default (disk cost).
--benchmark-memory-profile-native off Capture native (C/C++/Rust) stacks in the kept profile, so the flamegraph attributes memory inside extension code (polars/numpy/solver bindings) instead of one opaque ??? at ??? bucket. Only affects --benchmark-memory-profile runs; opt-in (slower, bigger .bin). Per-test override: @pytest.mark.benchmem(profile_native=True). Off by default.
--benchmark-memory-table combined Layout for the memory metrics: combined (default) folds them into pytest-benchmark's timing table; split prints a separate memory table.
--benchmark-memory-columns=peak,allocated,allocations,rss Which memory metrics the table shows, comma-separated and in order: peak, allocated, allocations, rss (rss only shows for isolated runs). Default: peak only.
--benchmark-memory-stats=min,mean,max With repeats > 1, the stats each shown metric spreads into: min, mean, max, median, stddev. A single pass stays one column. Default: min,mean,max.

Timing regressions still use pytest-benchmark's own --benchmark-compare / --benchmark-compare-fail; the --benchmark-memory-compare* flags are the memory mirror. Their baseline comes from pytest-benchmark's storage (.benchmarks/) — save one first with --benchmark-save=NAME or --benchmark-autosave, or the gate finds nothing and passes. See Gate CI on a regression.

The benchmem marker

@pytest.mark.benchmem(repeats=3)
def test_build(benchmark_memory):
    ...
Kwarg Default What
repeats auto force a fixed N memray passes for this test (default: adaptive — see below). Every pass is kept (the blob stores the whole series); the headline peak is the minimum across them, and --stat reports any other. Overrides the suite-wide --benchmark-memory-repeats.
warmup 1 untracked dry-runs of the action before measuring, to shed one-time costs (lazy imports, first-touch caches). 0 disables. Overrides the suite-wide --benchmark-memory-warmup.
isolate False run each memray pass in a fresh process and also record whole-process resident memory as the rss metric — the physical/OOM-relevant peak memray's logical heap can't give. Per-test only (no suite-wide flag): rss is a whole-job capacity number, meaningful only for build+operate benchmarks, so you mark the specific ones. Needs a top-level, picklable benchmarked function (see the whole-job warning below).
profile_native False on the --benchmark-memory-profile path, capture native (C/C++/Rust) stacks in the kept .bin, so a flamegraph attributes extension-code memory (polars/numpy/solver bindings) instead of one opaque ??? at ??? bucket. Opt-in (slower, bigger .bin). Overrides the suite-wide --benchmark-memory-profile-native.
max_peak fail the test if the headline peak exceeds this absolute ceiling. A size string ("100MiB", units B/KiB/MiB/GiB) or a bare int (bytes).
max_allocated as max_peak, on allocated (total bytes).
max_allocations as above, on the allocations count — a bare number (no unit).

Isolated rss measures the whole job — build the state inside the callable

The rss metric (isolate=True) runs the action in a fresh, empty process. Two consequences:

The build must happen inside the measured callable, and the callable must be a top-level, picklable function. The child starts with nothing, so it must construct whatever it operates on; and spawn serializes the call with standard pickle, so a lambda or closure is rejected (we don't use cloudpickle) — pass a module-level function plus lightweight args.

# ✅ ships only the spec (~bytes); the child builds + writes cold = the whole job's RSS
benchmark_memory(build_and_write, spec, n)

# ❌ a lambda/closure — rejected; std pickle can't serialize it (even build-inside)
benchmark_memory(lambda: write(build(spec, n)))

# ❌ a top-level partial over a *pre-built* model pickles fine, but ships the model and
#    measures *deserializing* it, not building it — the build never re-runs in the child
model = build(spec, n)
benchmark_memory(partial(write, model))

You can't isolate a single sub-phase. Since the child must build before it can operate, isolated rss is a build-plus-operate capacity number by construction, never a per-phase figure (e.g. write-only). For per-phase memory, use the in-process peak metric, which can measure a write given an already-built model. So the rule is two-part: use a top-level function (no lambdas), and don't pass heavy pre-built state — build it inside.

Absolute ceilings — max_peak / max_allocated / max_allocations

@pytest.mark.benchmem(max_peak="100MiB", max_allocations=5000)
def test_build(benchmark_memory):
    benchmark_memory(build_model, 1000)

A baseline-free guardrail: the test fails if the measured metric exceeds the ceiling (test_build: peak 117 MiB exceeds max_peak 100 MiB). Thresholds are absolute only — there's no saved run to take a percent of; for relative gating against a prior run use --benchmark-memory-compare-fail or benchmem compare --fail-on. A ceiling is a worst-case budget, so with repeats > 1 (including adaptive sampling) the gate reads the worst pass — not the headline min — and fails if any pass breaches it; the two coincide for a single pass. The ceiling is enforced wherever memory is measured — the benchmark_memory fixture and the --benchmark-memory patch — but a plain benchmark() call without --benchmark-memory measures no memory, so the marker is a no-op there.

Scope: the benchmarked action only

This gates the benchmarked action only (the isolated call pytest-benchmem measures), not the whole test. For a whole-test limit or leak check, that's pytest-memray's limit_memory / limit_leaks — see the README's "With pytest-memray".

How many passes? By default pytest-benchmem samples adaptively — after an untracked warmup run, it runs the memray pass until the min floor settles (≥2 passes; capped at 10, or a --benchmark-memory-max-time budget). Deterministic code settles in ~3 passes; noisy code runs more. Set repeats=N (marker) or --benchmark-memory-repeats=N (suite) to force a fixed, reproducible count — what CI gating against a saved baseline wants. Full rationale and the noisy-workload guidance are in the guide: Repeats & adaptive sampling.

The benchmark_memory fixture

Depends on pytest-benchmark's benchmark fixture; measures peak in a separate untimed pass, then times via pytest-benchmark.

Order — memory first (cold), then timing

Every call form runs the memray pass first, then pytest-benchmark's timing (calibration + all rounds). This matters: timing runs the function thousands of times, which grows and fragments the allocator's arenas — so measuring memory after timing would report the warm plateau, not the fresh-process floor the headline min is meant to be. Memory-first measures the cold cost (the warmup pass still sheds the one-time cold-start within it); timing then runs cleanly, with no memray hooks active. This holds for __call__, pedantic, and the --benchmark-memory patch alike. The standalone measure_peak / measure_memory have no timing phase at all; warmup=0 skips the warmup, repeats=N forces a fixed count.

Times then measures function(*args, **kwargs):

benchmark_memory(sorted, data)

Explicit control, like pytest-benchmark's pedantic plus a memory pass:

benchmark_memory.pedantic(target, args=(), kwargs=None, setup=None,
                          rounds=1, warmup_rounds=0, iterations=1)
  • setup — a callable run untracked before each measured call; if it returns (args, kwargs), those supply the call's arguments. Used for both the timed rounds and each (adaptive) memory sample — one setup rebuilds fresh state for both — so a stateful action's memory samples stay independent. The same applies to benchmark.pedantic(setup=…) under --benchmark-memory: no extra changes.
  • rounds, warmup_rounds, iterations — as in pytest-benchmark.

Mostly memory, little timing? There's no memory-only switch — the entry rides pytest-benchmark's timing. To trim it: --benchmark-min-rounds=1 --benchmark-max-time=0 (no test changes), or pedantic(rounds=1, warmup_rounds=0) for a single call. For pure memory outside pytest, use measure_peak / measure_memory.

Attributes (available after a call):

Attribute What
extra_info pytest-benchmark's per-benchmark dict. Set scalars here to attach analysis dims; the memory blob lands here under the benchmem key.
peak_bytes peak memory (bytes) from the last call, or None before any call.
result the full MemoryResult from the last call, or None.

The extra_info.benchmem blob

Each measured benchmark stores this dict under extra_info["benchmem"] — three flat per-repeat series, one entry per memray pass. Every reported number (headline peak = min, any --stat) derives from these on read:

Key What
peak_bytes per-repeat high-water of live bytes — the peak metric (headline = min)
allocations per-repeat allocation count — the allocations metric
total_bytes per-repeat total bytes allocated — the allocated metric (churn peak hides)
rss_bytes per-repeat whole-process resident high-water (ru_maxrss) — the rss metric. Only present under isolate=True (each pass in a fresh process); absent otherwise.
{"peak_bytes": [800000, 805000], "allocations": [12, 12], "total_bytes": [800000, 805000]}

See Choosing a metric for when to reach for each, and --stat for distributions.

CLI — benchmem

Installed with pytest-benchmem[plot]. The full command tree and every option, captured live from the typer app as it actually renders in a terminal:

benchmem --help
Usage: benchmem [OPTIONS] COMMAND [ARGS]...

pytest-benchmem — plot and compare benchmark runs.

╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
--install-completion Install completion for the current shell.
--show-completion Show completion for the current shell, to copy it or customize the
installation.
--help Show this message and exit.
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ───────────────────────────────────────────────────────────────────────────────────────╮
plot Render an interactive plotly view from one or more pytest-benchmark runs.
compare Print a per-id table for one run, or compare two or more (and optionally gate CI).
sweep Run a benchmark suite across several installed versions of a package.
flamegraph Render a kept memory profile in one step — resolve the ``.bin`` for a test and run
memray.
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
benchmem plot --help
Usage: benchmem plot [OPTIONS] RUNS...

Render an interactive plotly view from one or more pytest-benchmark runs.

╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
* runs... PATH pytest-benchmark JSON file(s). [required]
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
--columns [time|peak|allocated|allocation Metric to plot: time | peak |
s|rss] allocated | allocations | rss
(rss = isolated runs only). One
per figure (a plot has a single
value axis) — same flag as
`compare`; the spread shows as
whiskers via --band.
[default: time]
--view TEXT compare | scatter | sweep |
scaling (default: by count).
--facet TEXT Dim to facet by.
--pivot TEXT Comparison axis for --view
compare/scatter: fold a single
run along this dim instead of
across run-files (param:NAME or a
bare extra_info name); its values
become the compared series. Like
--group-by but it sets what's
*compared*, not how rows cluster.
Mutually exclusive with multiple
runs.
--x TEXT scaling: dim for the x-axis.
--clip FLOAT Clamp the colour scale.
--where TEXT Filter rows by dim: KEY=VALUE
(repeatable, AND-combined).
--free-axes [x|y|both] Free facet axes: x | y | both
(needs --facet).
--band [auto|minmax|none] scaling: spread whiskers on
memory metrics — auto | minmax |
none.
[default: auto]
--label -l TEXT Series label per run, in order
(repeat). Default: stem.
--output -o PATH HTML out.
--open --no-open [default: no-open]
--help Show this message and exit.
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
benchmem compare --help
Usage: benchmem compare [OPTIONS] RUNS...

Print a per-id table for one run, or compare two or more (and optionally gate CI).

╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
* runs... PATH One or more pytest-benchmark runs, oldest → newest. One prints a plain
table; two or more compare (a sweep is N).
[required]
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
--columns TEXT Comma list of metrics: time | peak | allocated | allocations | rss (rss
= isolated runs only; e.g. peak or time,peak,rss). Default: time,peak.
Each is shown across every --stat; a metric absent from every run is
dropped.
--group-by TEXT Group rows into sub-tables: fullname | name | func | group | module |
class | param:NAME (comma-composable).
[default: fullname]
--stat TEXT Which stat column(s) per metric: min | max | mean | median | stddev, or
all (the default) for the full spread side by side.
--sort TEXT Row order: name (id) | value (largest in the last run) | change.
[default: name]
--pivot TEXT Comparison axis: fold a single run along this dim instead of across
run-files — param:NAME or a bare extra_info name. Rows differing only in
it pair up and its values become the compared series. Like --group-by
but it sets what's *compared*, not how rows cluster. Mutually exclusive
with multiple runs.
--format TEXT Output format: table (rich terminal, default) | md (GitHub-flavored
markdown to stdout — redirect to a file or pipe into a PR comment /
$GITHUB_STEP_SUMMARY).
[default: table]
--csv PATH Also write the raw (unscaled) comparison to this CSV file.
--fail-on TEXT Exit non-zero on a regression of the first run vs the last (or, with
--pivot, the first dim value vs the last). FIELD:THRESHOLD, repeatable —
e.g. --fail-on peak:10% --fail-on peak:5MiB --fail-on rss:10% (rss gates
only isolated runs).
--help Show this message and exit.
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
benchmem sweep --help
Usage: benchmem sweep [OPTIONS] PACKAGE VERSIONS...

Run a benchmark suite across several installed versions of a package.

Provisions one fresh uv venv per version, runs 'pytest <suite> --benchmark-only'
in each writing <out>/<version>.json, then prints the next step. --memory adds
the memory pass; forward any other pytest flag with --pytest-arg, e.g.
benchmem sweep mypkg 1.2.0 1.3.0 --suite benchmarks/ --memory --pytest-arg=-k.

╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
* package TEXT Package under test; each plain version installs `<package>==<v>`.
[required]
* versions... TEXT Versions or pip specs to sweep, e.g. 1.2.0 1.3.0
git+https://github.com/me/pkg@main.
[required]
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
* --suite PATH Benchmark suite (dir or file) to run in each version's
venv.
[required]
--out PATH Directory for the per-version JSON runs.
[default: .benchmarks/sweep]
--memory --no-memory Add --benchmark-memory to each pytest run.
[default: no-memory]
--pytest-arg TEXT Arg forwarded to pytest, one token each, repeatable
(e.g. --pytest-arg=-k).
--pin TEXT Extra pip spec installed alongside (repeatable).
--as-of TEXT YYYY-MM-DD for uv --exclude-newer (reproducible
resolve).
--import-check TEXT Module asserted to resolve to the venv (isolation
preflight).
--copy-dir PATH Directory copied into each venv's cwd (the suite
imports from here).
--help Show this message and exit.
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
benchmem flamegraph --help
Usage: benchmem flamegraph [OPTIONS] PROFILE_DIR [TEST_ID]

Render a kept memory profile in one step — resolve the ``.bin`` for a test and run memray.

Closes the "regressed → *where*?" loop after ``--benchmark-memory-profile``: instead of
finding the right ``.bin`` and remembering the memray subcommand, point at the profile dir
and name the test (or ``--worst peak`` to auto-pick the heaviest). Defaults to an HTML
flamegraph written next to the ``.bin``; ``--report tree|summary|stats`` prints to the
terminal instead.

╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
* profile_dir PATH Directory of kept .bin profiles (--benchmark-memory-profile).
[required]
[test_id] TEXT Test id (exact, or a unique substring) to render; omit with --worst.
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
--worst TEXT Auto-pick the heaviest: peak | allocated | allocations
--report TEXT memray reporter: flamegraph | table | tree | summary | stats
[default: flamegraph]
--native --no-native Require the profile to carry native traces (captured via
--benchmark-memory-profile-native); error if it doesn't.
[default: no-native]
--output -o PATH HTML out path (default: next to the .bin).
--open --no-open Open the rendered HTML. [default: no-open]
--force -f Overwrite an existing render.
--help Show this message and exit.
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

Public Python API

Light to import — pytest_benchmem re-exports only the engine and the readers; pytest_benchmem.plotting pulls plotly and pytest_benchmem.sweep shells to uv, so import those submodules directly.

Engine

measure_peak

measure_peak(
    action: Action, repeats: int | None = None
) -> int

Run action() under memray.Tracker and return peak bytes.

The bare one-liner for a REPL or notebook; :func:measure_memory returns the full result (allocation count, spread). repeats behaves as there — None (default) samples adaptively, an int forces a fixed pass count.

Parameters:

Name Type Description Default
action Action

The zero-argument callable to measure.

required
repeats int | None

Fixed pass count, or None to sample adaptively.

None

Returns:

Type Description
int

Peak bytes (the headline peak = min across passes, after warmup).

measure_memory

measure_memory(
    action: Action,
    repeats: int | None = None,
    *,
    warmup: int = _DEFAULT_WARMUP,
    isolate: bool = False,
    max_time: float | None = None,
    min_passes: int = _ADAPTIVE_MIN_PASSES,
    max_passes: int = _ADAPTIVE_MAX_PASSES,
    patience: int = _ADAPTIVE_PATIENCE,
    keep_bin: Path | None = None,
    native: bool = False,
    setup: Action | None = None,
) -> MemoryResult

Run action() under memray.Tracker → :class:MemoryResult, one pass per repeat.

warmup untracked dry-runs run first to shed one-time costs; then each measured pass gets a fresh tracker. The headline is the min across passes (see :class:MemoryResult); every pass's :class:Measurement is kept for spread stats.

With isolate=True each measured pass runs in a fresh spawned process (each warming itself), and that child's whole-process resident high-water (ru_maxrss) is recorded as :attr:Measurement.rss_bytes — a physical-memory reading attributable to the action, which an in-process pass can't give. The action (and setup) must be picklable (a top-level callable, not a lambda/closure); keep_bin is ignored in this mode.

Two modes, by repeats:

  • repeats=N (an int) — run exactly N passes. Fixed and reproducible; what CI gating and saved-baseline comparisons want.
  • repeats=None (default) — sample adaptively: keep running passes until the min stops moving (no new low for patience passes), bounded by min_passes (≥2), max_passes, and an optional max_time budget. Deterministic code settles in a few passes; noisy code runs more.

Parameters:

Name Type Description Default
action Action

The zero-argument callable to measure.

required
repeats int | None

Fixed pass count, or None to sample adaptively.

None
warmup int

Untracked dry-runs (setup + action) before measuring; 0 disables.

_DEFAULT_WARMUP
isolate bool

Run each pass in a fresh spawned process and record its ru_maxrss as :attr:Measurement.rss_bytes. Requires a picklable action/setup.

False
max_time float | None

Wall-clock budget (seconds) for adaptive sampling; None = no time bound.

None
min_passes int

Minimum passes when sampling adaptively.

_ADAPTIVE_MIN_PASSES
max_passes int

Hard ceiling on passes when sampling adaptively.

_ADAPTIVE_MAX_PASSES
patience int

Stop adaptive sampling after this many consecutive passes with no new min.

_ADAPTIVE_PATIENCE
keep_bin Path | None

If set, the first pass's profile .bin is retained here (for a later :func:render_flamegraph); the rest still go to temp dirs and are discarded.

None
native bool

Capture native (C/C++/Rust) stacks in the kept .bin so a flamegraph can attribute memory inside extension code instead of an opaque native bucket. Costs runtime and disk; only meaningful with keep_bin (ignored otherwise).

False
setup Action | None

Optional zero-arg callable run untracked before each pass (and each warmup run) — its allocations are not measured. Use it to rebuild fresh state so a stateful action (one that caches on or mutates a carried-over object) gives independent samples instead of a decaying/accumulating series. Mirrors pytest-benchmark's pedantic(setup=...).

None

Returns:

Name Type Description
A MemoryResult

class:MemoryResult over every measured pass (warmup runs are not retained).

MemoryResult dataclass

A memory measurement across repeats passes, derived from the per-repeat samples.

The per-repeat :attr:samples are the single source of truth — that's all the blob stores (the series); everything else is derived from them on read.

The headline :attr:peak_bytes is the minimum peak across passes — the fresh-process floor, unbiased by the in-process warm plateau (repeated runs fragment/grow arenas and allocate more) that a central stat would report. :attr:allocations / :attr:total_bytes come from that same min-peak run (a coherent snapshot); :attr:peak_bytes_max is the worst peak, so the spread is visible. A warm-plateau / steady-state read is available via the mean / median --stat. A single pass collapses all of these to its own values.

repeats property

repeats: int

How many passes were measured.

representative property

representative: Measurement

The min-peak run — the one the headline peak/allocations/total_bytes come from.

peak_bytes property

peak_bytes: int

The headline peak — the minimum high-water across passes (the fresh-process floor).

peak_bytes_max property

peak_bytes_max: int

The worst peak across repeats (equals :attr:peak_bytes with one repeat).

allocations property

allocations: int

Allocation count from the representative (min-peak) run.

total_bytes property

total_bytes: int

Total bytes allocated by the representative (min-peak) run.

rss_bytes property

rss_bytes: int | None

Headline whole-process RSS — the minimum ru_maxrss across isolated passes (the cold floor, like :attr:peak_bytes), or None if memory wasn't measured in isolation (in-process has no attributable process-global RSS).

series

series(field: str) -> list[Any]

The per-repeat values of one series field (SERIES_FIELDS or optional).

as_dict

as_dict() -> dict[str, Any]

The JSON blob stored under pytest-benchmark extra_info["benchmem"].

The three core per-repeat series, flat, plus any :data:OPTIONAL_SERIES_FIELDS that were measured (all-or-nothing per result). No denormalized scalars and no repeats (it's len of any series). Everything else derives on read.

from_blob classmethod

from_blob(blob: Mapping[str, Any]) -> MemoryResult

Rebuild from a blob's per-repeat series. Core columns are required; any :data:OPTIONAL_SERIES_FIELDS are read when present (else left None).

Measurement dataclass

One repeat's raw numbers — memray's peak high-water, allocation count, and total bytes allocated (cumulative churn, incl. temporaries GC later frees), plus an optional whole-process resident high-water.

rss_bytes is getrusage's ru_maxrss from an isolated pass (a fresh child process); None in-process, where a process-global RSS isn't attributable to the action.

Readers & loader

from_pytest_benchmark reads timing (seconds, from stats); memory_from_pytest_benchmark reads memory (bytes, from extra_info.benchmem). load_samples is the unified reader; load_long_df stacks runs into the tidy frame the plots pivot. discover_runs collects saved runs from pytest-benchmark's .benchmarks/ storage, so you can hand the readers a directory instead of listing files.

from_pytest_benchmark

from_pytest_benchmark(
    path: str | Path, *, metric: str = "min"
) -> tuple[str, list[Sample], str]

Read timing out of a pytest-benchmark file → (label, samples, "s").

Dims come from each benchmark's parametrize params and extra_info, plus the structural node.* dims (see :func:_node_dims).

Parameters:

Name Type Description Default
path str | Path

A pytest-benchmark JSON file.

required
metric str

Which pytest-benchmark stat to read (min / median / …).

'min'

Returns:

Type Description
str

(label, samples, unit) — the run label, one :class:Sample per benchmark,

list[Sample]

and the unit ("s").

memory_from_pytest_benchmark

memory_from_pytest_benchmark(
    path: str | Path,
    *,
    field: str = "peak_bytes",
    reduce: Callable[[list[float]], float] | None = None,
) -> tuple[str, list[Sample], str]

Read memory out of a pytest-benchmark file → (label, samples, unit).

The benchmark_memory fixture stores each run's memory blob under extra_info["benchmem"] (a flat per-repeat series per field), keyed by the same benchmark id pytest-benchmark uses. Benchmarks lacking the blob (timing-only tests) are skipped. Dims come from parametrize params and extra_info, plus the structural node.* dims (see :func:_node_dims).

Parameters:

Name Type Description Default
path str | Path

A pytest-benchmark JSON file.

required
field str

Which series to read — peak_bytes (unit B), allocations (count), or total_bytes.

'peak_bytes'
reduce Callable[[list[float]], float] | None

Reduce the per-repeat series to one scalar. Default (None) derives the headline (peak = min, allocations/total_bytes = the min-peak run); pass a callable for a distribution stat over the series instead.

None

Returns:

Type Description
str

(label, samples, unit) — the run label, one :class:Sample per benchmark

list[Sample]

with the blob, and the unit (B or count).

load_samples

load_samples(
    path: str | Path,
    *,
    metric: Metric = "time",
    stat: str | None = None,
) -> tuple[str, list[Sample], str]

Read one pytest-benchmark file for the chosen metric(label, samples, unit).

The unified reader over :func:from_pytest_benchmark (timing) and :func:memory_from_pytest_benchmark (memory).

Parameters:

Name Type Description Default
path str | Path

A pytest-benchmark JSON file.

required
metric Metric

Which metric to read (time / peak / allocated / allocations).

'time'
stat str | None

Distribution stat over the metric's per-repeat series (min / max / mean / median / stddev); None reads the headline scalar. For time it selects the pytest-benchmark stat (default min).

None

Returns:

Type Description
tuple[str, list[Sample], str]

(label, samples, unit) — the run label, its samples, and the metric's unit.

load_long_df

load_long_df(
    runs: str | Path | Sequence[str | Path],
    *,
    metric: Metric = "time",
    stat: str | None = None,
    labels: Sequence[str] | None = None,
    pivot: str | None = None,
) -> tuple[pd.DataFrame, str]

Stack pytest-benchmark files (one path or a sequence) into one long frame → (df, unit).

One row per (run, id) for the chosen metric. Columns: snapshot (the series axis — see below), id (the pairing key), value, then one column per dim key seen (missing dims are NaN). Every plot view and the compare table pivots this frame, pairing rows on id and laying snapshot values side by side.

The series axis is just a dim. By default it's the run-file (snapshot = each file's label), which is why a run-file is a comparison axis: compare a.json b.json ranks one file against another. pivot re-points that axis at a real data dim instead — its values become snapshot and it's lifted out of each row's identity (dropped from the dims and stripped from the id) so rows differing only in it pair up. That lets one combined run be A/B'd along a config dim (--pivot param:semantics) exactly as two files are A/B'd today — the run-file is an external series axis, pivot promotes an internal dim to the same role. The two are mutually exclusive (an A/B view has one series axis), so pivot with more than one run is an error.

Parameters:

Name Type Description Default
runs str | Path | Sequence[str | Path]

One path or a sequence of pytest-benchmark JSON files.

required
metric Metric

Which metric to read (time / peak / allocated / allocations).

'time'
stat str | None

Distribution stat over the per-repeat series; None reads the headline scalar.

None
labels Sequence[str] | None

Overrides the snapshot label per run (one per path, same order), decoupling the display name from the filename; defaults to each file's stem.

None
pivot str | None

Use this dim as the series axis instead of the run-file (param:NAME or a bare extra_info name). Requires a single run.

None

Returns:

Type Description
tuple[DataFrame, str]

(df, unit) — the long-form frame and the metric's unit.

discover_runs

discover_runs(
    root: str | Path = ".benchmarks",
) -> list[Path]

Return pytest-benchmark JSON files under root (for CLI suggestions).

Parameters:

Name Type Description Default
root str | Path

Directory to search (default: pytest-benchmark's .benchmarks store).

'.benchmarks'

Returns:

Type Description
list[Path]

The JSON file paths found under root.

Sample

Bases: NamedTuple

One measured result: an opaque id, a value, and analysis dims.

Plotting — pytest_benchmem.plotting

Every plot_* returns (figure, n_ids). snapshots is a list of run JSON paths; labels names the series per run (defaults to the file stems) — the API behind plot's -l/--label. plot_compare's sort is "absolute" (native units) or "relative" (percent).

plot_scaling

plot_scaling(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    x: str | None = None,
    color: str | None = None,
    facet: str | None = None,
    log: bool | Literal["auto"] = "auto",
    band: Literal["auto", "minmax", "none"] = "auto",
    where: Mapping[str, str] | None = None,
    free_axes: FreeAxes | None = None,
    labels: Sequence[str] | None = None,
) -> tuple[Figure, int]

Cost vs a numeric dim, coloured/faceted by other dims.

x/color/facet default to inference from the dims (the lone numeric dim → x); pass them to override.

Parameters:

Name Type Description Default
snapshots Snapshots

Run JSON path(s).

required
metric Metric

Which metric to plot (time / peak / allocated / allocations).

'time'
x str | None

Dim for the x-axis (default: the lone numeric dim).

None
color str | None

Dim to colour by (default: inferred).

None
facet str | None

Dim to split into subplots (default: inferred).

None
log bool | Literal['auto']

"auto" log-scales when x is numeric and strictly positive; or force with a bool.

'auto'
band Literal['auto', 'minmax', 'none']

Spread whiskers (minmax of each point's per-pass series) on memory metrics — "auto" shows them where there's spread, "minmax" forces them on, "none" off. The line stays the headline (the min floor); whiskers reach up to the worst pass. Ignored for time.

'auto'
where Mapping[str, str] | None

Keep only rows matching these dim=value pairs.

None
free_axes FreeAxes | None

"x" / "y" / "both" — unmatch a faceted axis from the shared default ("x" for incommensurable sweeps, "y" when facets have different cost scales, e.g. per function).

None
labels Sequence[str] | None

Names the snapshot in the title (default: file stem).

None

Returns:

Type Description
tuple[Figure, int]

(figure, n_ids) — the plotly figure and the number of ids plotted.

plot_scatter

plot_scatter(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    facet: str | None = None,
    clip: float | None = None,
    where: Mapping[str, str] | None = None,
    free_axes: FreeAxes | None = None,
    labels: Sequence[str] | None = None,
    pivot: str | None = None,
) -> tuple[Figure, int]

Baseline cost (log-x) vs candidate/baseline ratio (log-y).

Top-right = slow and slower (the regressed corner). The first series is the baseline; with 3+, the rest animate. Colour encodes the absolute Δ. The series axis is the run-file by default; pivot re-points it at a data dim, folding a single run so its dim-values are the series (the first being the baseline) instead of the files (see :func:load_long_df).

Parameters:

Name Type Description Default
snapshots Snapshots

Run JSON path(s); the first is the baseline, extras animate (one run when pivot is set — the dim's values play that role).

required
metric Metric

Which metric to plot (time / peak / allocated / allocations).

'time'
facet str | None

Dim to split into subplots.

None
clip float | None

Clamp the colour scale (default p95).

None
where Mapping[str, str] | None

Keep only rows matching these dim=value pairs.

None
free_axes FreeAxes | None

Give each facet its own axes instead of sharing.

None
labels Sequence[str] | None

Series names per run (default: file stems). Ignored when pivot is set.

None
pivot str | None

Use this dim as the series axis instead of the run-file (param:NAME or a bare extra_info name); requires a single run.

None

Returns:

Type Description
tuple[Figure, int]

(figure, n_ids) — the plotly figure and the number of ids plotted.

plot_compare

plot_compare(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    sort: SortMode = "absolute",
    facet: str | None = None,
    clip: float | None = None,
    where: Mapping[str, str] | None = None,
    free_axes: FreeAxes | None = None,
    labels: Sequence[str] | None = None,
    pivot: str | None = None,
) -> tuple[Figure, int]

Bar chart of per-id delta, sorted by the chosen Δ (biggest regressions on top).

The first two series are compared; the first is the baseline. The series axis is the run-file by default (the first two files); pivot re-points it at a data dim, folding a single run so its first two dim-values become the A and B series instead (see :func:load_long_df).

Parameters:

Name Type Description Default
snapshots Snapshots

Run JSON path(s); only the first two are used (one run when pivot is set).

required
metric Metric

Which metric to plot (time / peak / allocated / allocations).

'time'
sort SortMode

absolute plots b - a in the native unit; relative plots percent change.

'absolute'
facet str | None

Dim to split into subplots.

None
clip float | None

Clamp the colour scale (default symmetric p95).

None
where Mapping[str, str] | None

Keep only rows matching these dim=value pairs.

None
free_axes FreeAxes | None

Give each facet its own axes instead of sharing.

None
labels Sequence[str] | None

Series names for the two runs (default: file stems). Ignored when pivot is set (the series are the dim's values).

None
pivot str | None

Use this dim as the series axis instead of the run-file (param:NAME or a bare extra_info name); requires a single run.

None

Returns:

Type Description
tuple[Figure, int]

(figure, n_ids) — the plotly figure and the number of ids plotted.

plot_sweep

plot_sweep(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    clip: float | None = None,
    where: Mapping[str, str] | None = None,
    labels: Sequence[str] | None = None,
) -> tuple[Figure, int]

Heatmap of per-id fold-change (log2 ratio) vs the first snapshot.

Parameters:

Name Type Description Default
snapshots Snapshots

Run JSON paths; columns in order, the first is the reference.

required
metric Metric

Which metric to plot (time / peak / allocated / allocations).

'time'
clip float | None

Clamp the colour scale.

None
where Mapping[str, str] | None

Keep only rows matching these dim=value pairs.

None
labels Sequence[str] | None

Column (version) names (default: file stems).

None

Returns:

Type Description
tuple[Figure, int]

(figure, n_ids) — the plotly figure and the number of ids plotted.

Sweeps — pytest_benchmem.sweep

See Cross-version sweeps for the narrative, the Venv object, and the provision parameters.

sweep

sweep(
    versions: Sequence[str],
    run: Callable[[Venv], None],
    **provision_kwargs: object,
) -> list[str]

Provision a venv per version and call run(venv) in each.

run does whatever the consumer needs (invoke pytest / a memory command with venv.python and cwd=venv.cwd). Returns the list of versions that failed to provision.