Breaking the Speed Limit

Fast statistical models with Python 3.14, Numba, and JAX

From a statistician's workflow: validate first, then accelerate the bottleneck.

Wenxin Jiang  ·  Jian Yin

PyCon US 2026 logo
QR code for slides and reproducible artifacts

Make the statistical task testable; then accelerate the measured bottleneck.

1Simulate
2Validate
3Find bottleneck
4Pick tools
1

Simulation Makes Correctness Testable

Real data often lacks known ground truth for method behavior. Simulation creates a controlled setting to test power, calibration, and agreement.

01Define Target

what to estimate, recover, preserve

02Build Scenarios

null, alternative, hard or extremecases

03Establish Reference

readable Python/NumPy implementation

04Verify Behavior

power, calibration, agreement

05Analyze Workload

data flow, loops, algebra, repetition, memory

06Select Tool

suitable for your task

Rule: Speedups are trusted only after validation passes.
2

Decisions Before Timing

QuestionDecision before measuring speed
Statistical targetestimator, test statistic, or stopping criterion
Simulated truthlabels, effect size, or nominal alpha
Acceptance criteriaexact match, numerical tolerance, or calibration
Failure modecode bug, method breakdown, or out-of-memory
Timing scopecompile, transfer, allocation
Cost driverssamples (n), features (p), or repeats (R)
3

Validation Checks Passed

Speedup results verified using these criteria:

K-means agreement: max relative inertia difference well below the tolerance.
Permutation equivalence: same test statistic, same p-value definition, same resampling stream; max |p diff| < tolerance.
Statistic difference: max |stat diff| < tolerance across the validation grid.
Null calibration: estimated type-I error within MC uncertainty of nominal α = 0.05.
AI

AI for Scale, Statistician for Science

AI Automates Execution
  • Algorithmic implementation
  • Scenario-grid orchestration
  • Automated metadata capture
  • Visual synthesis of results
  • Curation of research artifacts
Automation layer illustration

AI assistant automating experiment tasks

Statistician Owns Inference
  • Defining estimands & objectives
  • Validating distributional assumptions
  • Validation criteria
  • Establishing rigorous benchmarks
  • Deciphering edge cases & outliers
  • Authoring scientific conclusions
Statistician judgment illustration

Statistician making scientific decisions

4

Two Workloads, Two Bottleneck Patterns

K-means = Iterative Fitting

k-means iterative fitting illustration
Sequential statePairwise metricsMemory allocations

Permutation = Resampling Inference

permutation-test workflow schematic
Shared-memory concurrencyVectorized resamplingLinear algebra
Validation criteria: K-means Match cluster assignments by fixing seeds and convergence thresholds. Permutation Same W, same statistic, same p-value formula.
Validate the Statistic, Then Accelerate. Speed only counts when the statistical target is preserved.
5

Performance Depends on Workload Structure

K-means: Workload Shape Determines the Tool

  • Scalar-heavy loops: Numba excels via LLVM-compiled execution.
  • Dense algebra: NumPy dominates via optimized BLAS kernels.
  • Massive data batches: JAX/GPU wins on SIMT throughput.
Server k-means evidence showing implementation choice depends on workload structure

Permutation: GPU Scaling via Vectorized Batching

  • Infrastructure Requirement: High-concurrency batching and fused reductions
  • Observed Speedup: up to 8.54× for n=5k, p=500k, R=5k, using a matched CPU matrix baseline vs. A100 streamed end-to-end path.
  • Operational Boundaries: Explicit OOM profiling for high-dimensional tensors
GPU permutation decision map
6

Diagnostic & Intervention

Diagnostic Matrix: Profiling Signature vs. Solution

Validation FailureAudit target & numerical stability
Scalar-Heavy LoopsJIT Compilation (Numba)
Dense Matrix OperationsVectorized BLAS (NumPy)
Embarrassingly Parallel TasksShared-Memory Threading
Massive Tensor BatchesSIMT Acceleration (JAX/GPU)
Bandwidth-Bound WorkloadFused & Streamed Reductions
7

Validated Tool Mapping

Aligning identified computational bottlenecks with targeted software frameworks, independent of hardware benchmarking.

Numba Scalar-Heavy CPU Loops K-means Assignment & Update
NumPy / BLAS Dense Matrix Algebra Distance Identity & W @ X (CPU)
Thread Pools Concurrent Shared-Memory Tasks Permutation Worker Sweeps
JAX / GPU Massive Tensor Batching Streamed W @ X (Post-Validation)
8

Standardized Reporting

Validation Criteria: Explicit statistical targets and acceptance thresholds
Compute Environment: Local prototyping vs. server-grade CPU / GPU
Execution State: Transparent reporting of cold starts, warm runs, and JIT compilation
Memory Overhead: Accounting for data transfer, allocation, and garbage collection
Failure Modes: Explicit documentation of OOM events and hardware incompatibilities
Tool Parsimony: Deploying the simplest framework that preserves the target statistic