what to estimate, recover, preserve
Breaking the Speed Limit
Fast statistical models with Python 3.14, Numba, and JAX
From a statistician's workflow: validate first, then accelerate the bottleneck.
Make the statistical task testable; then accelerate the measured bottleneck.
Simulation Makes Correctness Testable
Real data often lacks known ground truth for method behavior. Simulation creates a controlled setting to test power, calibration, and agreement.
null, alternative, hard or extremecases
readable Python/NumPy implementation
power, calibration, agreement
data flow, loops, algebra, repetition, memory
suitable for your task
Decisions Before Timing
| Question | Decision before measuring speed |
|---|---|
| Statistical target | estimator, test statistic, or stopping criterion |
| Simulated truth | labels, effect size, or nominal alpha |
| Acceptance criteria | exact match, numerical tolerance, or calibration |
| Failure mode | code bug, method breakdown, or out-of-memory |
| Timing scope | compile, transfer, allocation |
| Cost drivers | samples (n), features (p), or repeats (R) |
Validation Checks Passed
Speedup results verified using these criteria:
AI for Scale, Statistician for Science
- Algorithmic implementation
- Scenario-grid orchestration
- Automated metadata capture
- Visual synthesis of results
- Curation of research artifacts
AI assistant automating experiment tasks
- Defining estimands & objectives
- Validating distributional assumptions
- Validation criteria
- Establishing rigorous benchmarks
- Deciphering edge cases & outliers
- Authoring scientific conclusions
Statistician making scientific decisions
Two Workloads, Two Bottleneck Patterns
K-means = Iterative Fitting
Permutation = Resampling Inference
Performance Depends on Workload Structure
K-means: Workload Shape Determines the Tool
- • Scalar-heavy loops: Numba excels via LLVM-compiled execution.
- • Dense algebra: NumPy dominates via optimized BLAS kernels.
- • Massive data batches: JAX/GPU wins on SIMT throughput.
Permutation: GPU Scaling via Vectorized Batching
- Infrastructure Requirement: High-concurrency batching and fused reductions
- Observed Speedup: up to 8.54× for n=5k, p=500k, R=5k, using a matched CPU matrix baseline vs. A100 streamed end-to-end path.
- Operational Boundaries: Explicit OOM profiling for high-dimensional tensors
Diagnostic & Intervention
Diagnostic Matrix: Profiling Signature vs. Solution
Validated Tool Mapping
Aligning identified computational bottlenecks with targeted software frameworks, independent of hardware benchmarking.