PyCon US 2026

Breaking the Speed Limit

Statistician’s workflow. Fast models.

Wenxin Jiang, Jian Yin | Biostatistics, CityU HK

Target first. Speed later.Preserve statistical results.
Simulate for control.Define baseline behavior.
Python orchestrates execution.Match tools to bottlenecks.
Why this talk

Speed follows validation

Specify. Execute. Verify.

Target First

Define core statistic.

Executable Reference

Build matchable baseline.

Conditional Speed

Answer identical question.

Speed only counts when the math is preserved.

Simulation as testing

Simulation creates an answer key

Real-world data doesn't have ground truth.

01
Statistical target What to estimate?
02
Simulation scenarios Edge cases. Nulls.
03
Reference implementation Executable baseline.
validate before timing
04
Validation checks Prove equivalence.
05
Bottleneck shape Identify the limit.
06
Acceleration Apply smallest tool.
Programmer bridge

Simulation turns statistical behavior into tests

Expected behavior is statistical.

How simulation maps to software testing
Unit tests
Fixed-seed equivalence.
Property tests
Null invariants. Calibration.
Stress tests
Outliers. High dimension.
Load tests
Scale samples. Scale features.
Regression tests
Acceptance criteria pass.
Why Python

Python helps us bridge the gap between math and machines.

Keep it readable. Move hotspots to optimized engines.

NumPy referenceExecutable target.
Validation checksRecovery. Calibration. Equivalence.
Hotspot shapeLoops. Dense algebra. Repetition.
NumPyDense algebra. BLAS-backed.
NumbaCPU numerical loops. No temporaries.
Threads / workersRepeated shared array work.
JAX / GPUBatched array programs.
Why these two examples

Two workloads, two bottleneck shapes

k-means. Iterative fitting.

Assign. Update. Repeat.

Sequential state dependence.

distance loops temporaries compiled kernels

Permutations. Resampling inference.

Shuffle. Compute. Repeat.

Independent. High-dimensional limits.

shared arrays batching W @ X
Workload 1

k-means iterative pressure.

Same validation. Vary scale.

Method visualization

K-means movement.

Form clusters. Find centers.

k-means animation poster frame

Initialize. Then update.

1. Fixed Start. Three initial centroids.
2. One Lloyd run
Assign Nearest centroid.
Update Assigned mean.
Repeat until stable.
3. Stop. Assignments stop changing.

Updates depend on previous state.

Labels are just context.

Why k-means works here

K-means is a useful test.

Exposes broad iterative pressures.

Assignment-update shape.

assign Nearest centroid.
update Recompute centroids.
repeat Stop when stable.

Repeated distance work.

N × K × d

samples × clusters × dimensions.

Sequential state visibility.

Validation contract.

Same data.
Same initialization.
Same stop.
Same objective.

Harder scenarios.

Overlap.
Imbalance.
Outliers.
High dimension.

Bottleneck shapes.

Loops.
Algebra.
Arrays.

Why it generalizes.

Broad iterative pressure.

Pressure: N × K × d per iteration.
Simulation result before speed result

Did k-means recover groups?

Simulation provides the truth.

ARI recovery score. 1 = perfect. 0 ≈ random baseline. Can be negative for worse-than-chance agreement.
Scenario stressors. Separation. Dimension. Outliers. Imbalance.
Before timing. Speed fails without recovery.
large k-means ARI recovery heatmap by scenario
Scenario difficulty summary.
Reference gate

Compare against reference.

Time only after matching.

Same data.
Same initialization.
Same stop.
Readable reference.Executable oracle.
Numba. NumPy. JAX.Must preserve results.
Compare inertia + labels.Stay below tolerance.

main check

3.1e-14

Max inertia diff.

Label agreement: ARI vs reference = 1.0.

tolerance

1e-8

Skip memory-risk rows explicitly.

Implementation shape

Three ways to compute.

Same validation. Three expressions.

NumPy

Dense matrix algebra.

x2 = sumsq(X)[:, None]
c2 = sumsq(C)[None, :]

dist2 = x2 + c2 - 2 * (X @ C.T)

labels = argmin(dist2, axis=1)

Shape: BLAS-backed algebra.

Numba

Keep explicit loops.

@njit(parallel=True)
for i in prange(n):
    best = inf

    for k in range(K):
        dist = sqdist(X[i], C[k])
        if dist < best:
            best = dist
            labels[i] = k

Shape: CPU loop-heavy.

JAX / GPU

Array expression on device.

@jax.jit
def assign(X, C):
    x2 = sumsq(X)[:, None]
    c2 = sumsq(C)[None, :]

    dist2 = x2 + c2 - 2 * (X @ C.T)
    return jnp.argmin(dist2, axis=1)

Shape: Batch for accelerators.

Match expression to bottleneck shape.
Scale tier

Engine depends on shape.

Read shapes, not rankings.

Scalar-heavy loops.

N=100k, d=10, K=5

Numba
0.09s
NumPy
0.26s
JAX / GPU
0.42s

Numba fits loop-heavy work.

Dense distance algebra.

N=1M, d=256, K=50

Numba
65s
NumPy
32s
JAX / GPU
37s

NumPy fits dense algebra.

Massive regular batches.

large, regular array work

CPU paths
base
JAX / GPU
faster

JAX GPU fits large regular batches.

Shape determines the tool.
Workload 2

Permutations. Resampling inference pressure.

Expensive with many features.

Method visualization

Permutation test scaling.

Shuffle labels. Build null.

permutation animation poster frame

One resampling loop.

  1. Shuffle labels.
  2. Compute group differences.
  3. Repeat for null.

Independent repeats create work.

Code shape

Same math. New shape.

Rewrite the repetition.

Reference loop.

for r in range(R):
    labels_r = same_permutation_stream[r]
    stat[r, :] = group_difference(X, labels_r)

Optimized path.

for W_batch in contrast_batches(
        same_permutation_stream):
    T = W_batch @ X
    counts += abs(T) >= abs(T_obs)
W_batch: batch_R × n X: n × p T: batch_R × p

Stream avoids R × p matrix.

Correctness gate 1

Equivalence: loop vs. matrix.

Verify before timing.

Same inputs.

Same data.
Same stream.
Same definition.

Reference.

Loop formulation.
One by one.

Optimized path.

Matrix formulation.
Batched computation.

validation result

PASS

validation grid

18 × 5

workloads × seeds

0.0 max |p diff|
9.4e-16 max |stat diff|
1e-6 validation tolerance

Acceptance criteria

P-values agree exactly.

Stat differences are roundoff dust.

Optimized path approved.

Same result. Faster code.
CORRECTNESS GATE 2

Null calibration check.

Type-I error near 0.05.

Why calibrate?Checks inference behavior.
Observed check.Error estimate ≈ 0.051.
Before scaling.Acceptable for scaling.
calibrated
0.051Type-I error estimate.
Calibration ruler
feature-level binomial band 0.036-0.064
0.0250.0350.0450.0550.0650.075
Dashed: Nominal alpha 0.050. Dot: Observed mean 0.051.

Band is per-feature binomial variation, not CI for the mean.

Calibration checks inference.

Equivalence: code. Calibration: inference.
LOCAL SCALE SIGNAL

When does runtime bend?

p and R reveal bottlenecks.

Same statistic.

Same stream and definition.

equivalence passed

Local runtime.

MacBook. Batched NumPy. n=500.

warm median

Scale question.

When does runtime explode?

local runtime scaling
Local permutation runtime scaling signal by p features and R permutations

MacBook Air local tier · NumPy batched matrix · n=500 · warm median. Shape signal. Not leaderboard.

Local runtime motivates server/GPU.
DECISION MAP

When do GPUs help?

Requires massive R × p.

ScopeSpeedup: CPU vs. GPU e2e.
Explicit OOM cells.Unavailable for GPUs.
GPU permutation decision map by p features and R permutations
Validation and batching unlock GPUs.

Transfer included; compile excluded. Not a leaderboard.

PARALLELISM

Parallelism helps. Then flattens.

Measure time and marginal gain.

Total time drops. Parallelism helped early. runtime
Returns flatten. Gains shrink later. marginal gain
Memory matters. Workers consume memory. memory
CPU parallelism sweep showing total time and diminishing marginal runtime gains
Capture early speedup.

Execution model matters next.

EXECUTION MODEL

No-GIL changes the process-vs-thread tradeoff.

CPU-bound Python permutation/resampling work over shared in-memory data.

Runtime scaling for standard GIL threads and free-threaded threads across ThreadPoolExecutor workers
Standard GIL builds ~1.02× 1 → 16 workers py314 stays flat
CPython 3.14t / free-threaded 4.77× 1 → 16 workers threads scale this workload
Execution-model payoff 2.43× faster 32% less peak RSS ThreadPool vs ProcessPool

The claim is not that Python is automatically faster; it is that free-threaded CPython makes threads a plausible shared-data parallel strategy for this workload shape.

Connecting the tools

What each tool solved.

Match tool to shape.

Numba

Bottleneck:
CPU loops.

Solved: K-means assignment.

NumPy / BLAS

Bottleneck:
dense algebra.

Solved: Distance identity.

Threads / workers

Bottleneck:
shared data repeats.

Solved: Permutation workers.

JAX / GPU

Bottleneck:
device-resident batches.

Solved: Streamed W @ X.

Same statistic. Different tool.
DECISION GUIDE

Evidence to tool.

Follow bottlenecks, not the leaderboard.

1Simulate
2Validate
3Find the bottleneck
4Pick the suitable tool

Hot loop

SignalLoop dominates.
LimitPython overhead.
MoveNumba.
k-means loops

Matrix algebra

SignalMatrix operations.
LimitDense kernel.
MoveNumPy / BLAS.
distance identity

Repeat work

SignalRepeats over data.
LimitScheduling / memory.
MoveThreads or workers.
permutation workers

Large streamed batch

SignalBatches fit device.
LimitRegular array work.
MoveJAX / GPU.
GPU batch
Validate first. Move bottlenecks.
AI coding tools

AI scales the workflow. We own the claims.

AI automation.

  • Implementation variants.
  • Scenario runners.
  • Metadata capture.
  • Plot regeneration.
  • Result manifests.
AI assistant automating experiment tasks

Scientific judgment.

  • Statistical target.
  • Data assumptions.
  • Validation criteria.
  • Hard-case interpretation.
  • Scientific claims.
Statistician making scientific decisions

AI automates. Humans claim.

Close

Make the statistic testable.
Then make the bottleneck fast.

Clear enough to validate. Fast enough to scale.

Thank you

Questions?

Slides. Reproducible artifacts.

github.com/LucaJiang/FastStatisticalModels4Python

QR code for slides and reproducible artifacts repository

Slides. Reproducible artifacts.

Backup: Evidence scope.

Evidence map. Different questions.

Why three environments?

MacBook = validation.

Equivalence. Calibration. Local shape.

Server = scale. Parallelism.

Shape scaling. Memory sweeps.

GPU = accelerator pipeline.

Break-even evidence. Streamed reduction.

Detailed metrics in repository.

Backup: Local validation.

Validation inventory.

Code. Inference. Runtime shape.

Local permutation validation inventory with equivalence, null calibration, and runtime scaling
Backup: K-means loops.

Numba wins loop shapes.

Q&A scope

Question: When does Numba win?

Local tier: MacBook runs.

Answer: When explicit loops dominate.

k-means shape stress
Loops favor Numba. Algebra favors BLAS.
Backup: Statistical power.

Power vs. effect size.

Do we detect strong signals?

permutation power curve showing stronger simulated signals are easier to detect
Strong signals yield high power.
Backup: batch_R=8192.
BACKUP: GPU TUNING

Backup: what batch_R tuning showed

Larger batches amortized overhead; 8192 was fastest only within this measured grid.

A100 batch_R tuning with full end-to-end time and CPU over A100 speedup

Timing improved most at small batch_R; after 512-1024 the curve mostly flattened. 8192 is the fastest accepted value measured here.

Definition batch_R = permutation rows processed in one A100 batch.
Trend Small batches were slower; larger batches reduced stream overhead.
Diminishing returns Most of the gain appeared by 512-1024; later points were nearly flat.
Claim boundary 8192 was fastest in this measured grid. Larger values were not tested here.
batch_R tunes pipeline overhead; it does not change the statistical question.
Backup: AI workflow.

What the agent automated.

Experiment orchestration.

Runners. Scenarios. Manifests.

Visualization surface.

Regenerated figures from CSVs.

Repository surface.

Commands. Summaries. QA notes.

AI automated. Humans claimed.
Backup: standard GIL vs free-threaded

Scaling evidence.

Speedup versus each interpreter's own one-worker baseline

Compare scaling, not base speed.

ThreadPoolExecutor workers

Build 1 worker 16 workers Ratio
py314 standard GIL 0.584s 0.574s 1.02×
CPython 3.14t free-threaded 0.565s 0.119s 4.77×

CPython 3.14 answers the standard-GIL baseline; CPython 3.14t answers the free-threaded execution-model question.

ThreadPool vs ProcessPool runtime
py314 ProcessPool0.571s 3.14t ThreadPool0.235s
2.43× faster
ThreadPool vs ProcessPool peak RSS
py314 ProcessPool1.017 GiB 3.14t ThreadPool0.692 GiB
32% lower

ThreadPool shares dataset.