PyCon US 2026

Breaking the Speed Limit

Statistician’s workflow. Fast models.

Wenxin Jiang, Jian Yin | Biostatistics, CityU HK

Target first. Speed later.Preserve statistical results.

Simulate for control.Define baseline behavior.

Python orchestrates execution.Match tools to bottlenecks.

Welcome everyone. I am Wenxin, And this is my colleague Jian. We are PhD students at City University of Hong Kong. Our research focuses on developing statistical methods and software for large-scale genetic data analysis. But, fortunately, we are not here to talk about genetics or mathematics. Today, we are discussing how to safely push the speed limits of statistical models in Python. In fields like genetics, speed is absolutely critical. The human genome has millions of variants. When we test their relationships to a disease, we execute billions of calculations. Without optimization, a single experiment can take months. If we can speed up the code by 10x, we can get results in several days. So, we must write fast code. But there is a strict rule: we cannot sacrifice statistical validity. If an optimized engine slightly changes the math, we are simply producing the wrong scientific conclusion faster. Today, we will explore how to prevent that.

Why this talk

Speed follows validation

Specify. Execute. Verify.

Target First

Define core statistic.

Executable Reference

Build matchable baseline.

Conditional Speed

Answer identical question.

Speed only counts when the math is preserved.

Simulation as testing

Simulation creates an answer key

Real-world data doesn't have ground truth.

01

Statistical target What to estimate?

02

Simulation scenarios Edge cases. Nulls.

03

Reference implementation Executable baseline.

validate before timing

04

Validation checks Prove equivalence.

05

Bottleneck shape Identify the limit.

06

Acceleration Apply smallest tool.

→

Because real-world data does not have ground truth, if we simply run our code on it and we cannot check if the results are correct. Instead, we use simulation to create synthetic datasets where we know the underlying parameters and relationships. By designing simulation scenarios, we can ensure that our implementation behaves correctly across a range of inputs. This allows us to test whether our accelerated code is producing the correct results compared to the known truth and to our baseline implementation. This process of validating before timing ensures that we are optimizing code that is already correct, rather than optimizing code that may have bugs. Also, by using simulation, we can identify the bottleneck shapes in our code, which informs us on how to apply the appropriate optimization techniques.

Programmer bridge

Simulation turns statistical behavior into tests

Expected behavior is statistical.

How simulation maps to software testing

Unit tests

Fixed-seed equivalence.

Property tests

Null invariants. Calibration.

Stress tests

Outliers. High dimension.

Load tests

Scale samples. Scale features.

Regression tests

Acceptance criteria pass.

I believe most of you are more familiar with software testing. In software engineering, we write various of tests to ensure that our code behaves as expected. This is exactly what we do with simulation in statistics. For example, a unit test in software engineering checks if a specific function produces the expected output for a given input. In statistics, despite the randomness, we can set a fixed random seed and check if our implementation produces the same results as our reference implementation. We also have property tests, which check if certain properties hold under specific conditions. In statistics, this could be checking if our method is invariant under null conditions or if it is well-calibrated. Stress tests in software engineering check how the system behaves under extreme conditions, which is analogous to testing our statistical methods with outliers or in high-dimensional settings. Load tests check how the system performs under heavy load, which corresponds to scaling up the sample sizes in our data. Finally, regression tests ensure that new changes do not break existing functionality, which is similar to ensuring that our optimized code still passes all the acceptance criteria we established with our reference implementation.

Why Python

Python helps us bridge the gap between math and machines.

Keep it readable. Move hotspots to optimized engines.

NumPy referenceExecutable target.

→

Validation checksRecovery. Calibration. Equivalence.

→

Hotspot shapeLoops. Dense algebra. Repetition.

NumPyDense algebra. BLAS-backed.

NumbaCPU numerical loops. No temporaries.

Threads / workersRepeated shared array work.

JAX / GPUBatched array programs.

Python helps us a lot in this process. It allows us to write readable code that closely follows our mathematical formulas. It also has a rich ecosystem of libraries and tools that we can use to optimize specific bottlenecks without having to rewrite our entire codebase. Typically, we start with a NumPy implementation as our reference. We then run our validation checks to ensure that this implementation is correct. We profile the code to identify the hotspots, which are the parts of the code that take the most time. These hotspots often have specific computational shapes, such as numerical loops, dense linear algebra operations, or repeated work on shared arrays. Once we have identified the hotspots in our code, we can choose the appropriate optimization tool. For example, numpy with BLAS, numba, jax, or multi-threading. We will demonstrate this using two workloads: K-means clustering and permutation testing.

Why these two examples

Two workloads, two bottleneck shapes

k-means. Iterative fitting.

Assign. Update. Repeat.

Sequential state dependence.

distance loops temporaries compiled kernels

Permutations. Resampling inference.

Shuffle. Compute. Repeat.

Independent. High-dimensional limits.

shared arrays batching W @ X

Workload 1

k-means iterative pressure.

Same validation. Vary scale.

Method visualization

K-means movement.

Form clusters. Find centers.

Initialize. Then update.

1. Fixed Start. Three initial centroids.

2. One Lloyd run

Assign Nearest centroid.

→

Update Assigned mean.

Repeat until stable.

3. Stop. Assignments stop changing.

Updates depend on previous state.

Labels are just context.

We use the famous iris dataset to demonstrate the k-means algorithm. Each point in the scatter plot represents a flower, where x and y are two measurements of the flower. We want to cluster these points into 3 groups, which ideally correspond to the three species of iris in the dataset. The purpose of K-means is to minimize the objective function value, also known as inertia in K-means, which related to the sum of distances from each point to its assigned centroid. K-means involves several steps: First, we initialize the centroids, which can be done in various ways, but for this example, we will use a fixed initialization to ensure that our results are reproducible. Then we perform one run of algorithm, which consists of two main steps: the assignment step, where we assign each data point to the nearest centroid, and the update step, where we recompute the centroids as the mean of the assigned points. We can see the color of points change as they get reassigned to different clusters, and the centroids move as they get updated. We repeat these steps until convergence, which typically happens when the assignments no longer change. The key point here is that each update depends on the previous state of the centroids and assignments, creating a sequential dependency that can be a bottleneck in terms of performance.

Why k-means works here

K-means is a useful test.

Exposes broad iterative pressures.

Assignment-update shape.

assign Nearest centroid.

→

update Recompute centroids.

→

repeat Stop when stable.

Repeated distance work.

N × K × d

samples × clusters × dimensions.

↻

Sequential state visibility.

Validation contract.

Same data.
Same initialization.
Same stop.
Same objective.

Harder scenarios.

Overlap.
Imbalance.
Outliers.
High dimension.

Bottleneck shapes.

Loops.
Algebra.
Arrays.

Why it generalizes.

Broad iterative pressure.

Pressure: N × K × d per iteration.

K-means is a useful test case for our framework because it exposes broad iterative pressures that are common in many statistical algorithms. The assignment-update shape of K-means creates a bottleneck due to the repeated distance calculations, which have a computational complexity of N × K × d, where N is the number of samples, K is the number of clusters, and d is the data dimensions. We can validate our implementation by ensuring that we are using the same data, the same initialization, the same stopping criteria, and the same objective function. We can also test our implementation under harder scenarios, such as when there is overlap between clusters, when there is imbalance in cluster sizes, when there are outliers, or when we are working in high-dimensional spaces. By identifying the bottleneck shapes in our code, such as loops, algebraic operations, or array manipulations, we can apply the appropriate optimization techniques. The iterative nature of K-means means that these pressures are repeated across iterations, making it a good example to demonstrate our framework for validating and optimizing statistical code.

Simulation result before speed result

Did k-means recover groups?

Simulation provides the truth.

ARI recovery score. 1 = perfect. 0 ≈ random baseline. Can be negative for worse-than-chance agreement.

Scenario stressors. Separation. Dimension. Outliers. Imbalance.

Before timing. Speed fails without recovery.

large k-means ARI recovery heatmap by scenario

Scenario difficulty summary.

Before optimizing, we must ask: did K-means really recover the true clusters? Therefore, we need to validate the method first. In our simulations, we use the Adjusted Rand Index (ARI) as a recovery score, where 1 indicates perfect recovery and 0 indicates a random baseline. We test our implementation under various scenario stressors, such as separation between clusters which related to the overlap of clusters, dimensionality of the data, presence of outliers, and imbalance in cluster sizes. The key point is that we should not accelerate our code until we are satisfied with its performance. In the process of algorithm development, we usually detect what we called "failure modes" where the method does not perform well. In this plot, we find that the method performs poorly under certain scenarios. Although we can optimize our algorithm's performance with statistics, to simplify this presentation, we just skip that steps and assume its ready for optimization. We will directly move on to the validation checks that we use to ensure that our optimized code produces the same results as our reference implementation.

Reference gate

Compare against reference.

Time only after matching.

Same data.

Same initialization.

Same stop.

↓

Readable reference.Executable oracle.

↓

Numba. NumPy. JAX.Must preserve results.

↓

Compare inertia + labels.Stay below tolerance.

main check

3.1e-14

Max inertia diff.

Label agreement: ARI vs reference = 1.0.

tolerance

1e-8

Skip memory-risk rows explicitly.

As previously mentioned, we use same data, same initialization, same stopping criteria, and same objective function to validate our implementation. We already have a reference implementation in NumPy that serves as our executable oracle. When we implement our optimized versions using Numba, JAX, or other tools, we need to ensure that they preserve the results of the reference numpy implementation. To compare the results, we look at both the objective function value and the predicted labels. We set a tolerance level for the difference in objective function, which is 1e-8 (1 times 10 to the minus 8). In our tests, we observed a maximum difference of 3.1e-14 (3.1 times 10 to the minus 14), which is well below our tolerance threshold. Additionally, we check that the label agreement, measured by ARI, is equal to 1.0, indicating perfect agreement between our accelerated implementation and the reference implementation. By passing these checks, we can be confident that our optimized implementations are correct.

Implementation shape

Three ways to compute.

Same validation. Three expressions.

NumPy

Dense matrix algebra.

x2 = sumsq(X)[:, None]
c2 = sumsq(C)[None, :]

dist2 = x2 + c2 - 2 * (X @ C.T)

labels = argmin(dist2, axis=1)

Shape: BLAS-backed algebra.

Numba

Keep explicit loops.

@njit(parallel=True)
for i in prange(n):
    best = inf

    for k in range(K):
        dist = sqdist(X[i], C[k])
        if dist < best:
            best = dist
            labels[i] = k

Shape: CPU loop-heavy.

JAX / GPU

Array expression on device.

@jax.jit
def assign(X, C):
    x2 = sumsq(X)[:, None]
    c2 = sumsq(C)[None, :]

    dist2 = x2 + c2 - 2 * (X @ C.T)
    return jnp.argmin(dist2, axis=1)

Shape: Batch for accelerators.

Match expression to bottleneck shape.

Here are the details of the three implementations for the assignment step of K-means. The first implementation uses NumPy to express the distance calculations as dense matrix algebra. We compute the squared distances using the formula that involves the sum of squares of the data points and centroids, and then we use matrix multiplication to compute the cross term. This approach can be efficiently executed using BLAS libraries, which are optimized for performance on CPUs. The second implementation uses Numba to write loops that compute the distances and find the nearest centroid for each data point. This approach allows us to parallelize the computation across CPU cores, which can be beneficial for smaller datasets or when the distance calculations are not easily expressed as matrix operations. The third implementation uses JAX to write a similar expression to the NumPy version, but it is compiled to run on a GPU for potentially faster execution. JAX allows us to write code that looks like NumPy, but it can be executed on different hardware backends, including GPUs and TPUs. Each of these implementations has its own advantages and trade-offs depending on the specific bottleneck shape.

Scale tier

Engine depends on shape.

Read shapes, not rankings.

Scalar-heavy loops.

N=100k, d=10, K=5

Numba

0.09s

NumPy

0.26s

JAX / GPU

0.42s

Numba fits loop-heavy work.

Dense distance algebra.

N=1M, d=256, K=50

Numba

65s

NumPy

32s

JAX / GPU

37s

NumPy fits dense algebra.

Massive regular batches.

large, regular array work

CPU paths

base

JAX / GPU

faster

JAX GPU fits large regular batches.

Shape determines the tool.

So, which engine is the best choice? It depends on your bottleneck's shape. In the first row, K and d are small. The work is scalar-loop heavy: many simple point-to-centroid comparisons. Here Numba is a natural fit because we can keep the algorithm close to explicit loops, remove Python interpreter overhead, and avoid building a large dense distance matrix. In the second row, the shape changes. With larger dimension and more clusters, the distance computation becomes dense algebra. The BLAS libraries that back NumPy are highly optimized for this kind of work. That is why NumPy can beat numba here. In the third row, the work is large, regular, and batched enough for the accelerator path. JAX on GPU can help, but only because there is enough device-friendly work to transfer data, batching, and pipeline overhead. It is not a free GPU win. So the conclusion is not “Numba is best,” “NumPy is best,” or “GPU is best.” The conclusion is: validate the result first, identify the bottleneck shape, then choose the suitable tool that fits that shape.

Workload 2

Permutations. Resampling inference pressure.

Expensive with many features.

Method visualization

Permutation test scaling.

Shuffle labels. Build null.

One resampling loop.

Shuffle labels.
Compute group differences.
Repeat for null.

Independent repeats create work.

In permutation tests, we repeatedly shuffle the labels of our data and compute the group differences to build a null distribution. This process involves three main steps. Here's our data. Each row is a sample, and each column is a feature. We have two groups A and B, and we want to test if there are significant differences between these groups for each feature. First, we shuffle the labels while keeping the data matrix X unchanged. Second, we compute the group differences for each feature, which gives us a statistic for every feature. Third, we repeat this process many times to build the null distribution. The repeats are independent, but they are not free in terms of computational cost. With R permutations and p features, the amount of work scales as R times p, which can become a performance problem when there are many features or many permutations. The W @ X notation on the figure represents the later implementation view, where we can batch the permutation using matrix operations. But, the underlying statistical question remains the same.

Code shape

Same math. New shape.

Rewrite the repetition.

Reference loop.

for r in range(R):
    labels_r = same_permutation_stream[r]
    stat[r, :] = group_difference(X, labels_r)

Optimized path.

for W_batch in contrast_batches(
        same_permutation_stream):
    T = W_batch @ X
    counts += abs(T) >= abs(T_obs)

W_batch: batch_R × n X: n × p T: batch_R × p

Stream avoids R × p matrix.

Correctness gate 1

Equivalence: loop vs. matrix.

Verify before timing.

Same inputs.

Same data.
Same stream.
Same definition.

Reference.

Loop formulation.
One by one.

Optimized path.

Matrix formulation.
Batched computation.

validation result

PASS

validation grid

18 × 5

workloads × seeds

0.0 max |p diff|

9.4e-16 max |stat diff|

1e-6 validation tolerance

Acceptance criteria

✓

P-values agree exactly.

✓

Stat differences are roundoff dust.

✓

Optimized path approved.

Same result. Faster code.

CORRECTNESS GATE 2

Null calibration check.

Type-I error near 0.05.

Why calibrate?Checks inference behavior.

Observed check.Error estimate ≈ 0.051.

Before scaling.Acceptable for scaling.

calibrated

0.051Type-I error estimate.

Calibration ruler

feature-level binomial band 0.036-0.064

0.0250.0350.0450.0550.0650.075

Dashed: Nominal alpha 0.050. Dot: Observed mean 0.051.

Band is per-feature binomial variation, not CI for the mean.

Calibration checks inference.

Equivalence: code. Calibration: inference.

LOCAL SCALE SIGNAL

When does runtime bend?

p and R reveal bottlenecks.

Same statistic.

Same stream and definition.

equivalence passed

Local runtime.

MacBook. Batched NumPy. n=500.

warm median

Scale question.

When does runtime explode?

local runtime scaling

Local permutation runtime scaling signal by p features and R permutations

MacBook Air local tier · NumPy batched matrix · n=500 · warm median. Shape signal. Not leaderboard.

Local runtime motivates server/GPU.

DECISION MAP

When do GPUs help?

Requires massive R × p.

ScopeSpeedup: CPU vs. GPU e2e.

Explicit OOM cells.Unavailable for GPUs.

GPU permutation decision map by p features and R permutations

Validation and batching unlock GPUs.

Transfer included; compile excluded. Not a leaderboard.

PARALLELISM

Parallelism helps. Then flattens.

Measure time and marginal gain.

Total time drops. Parallelism helped early. runtime

Returns flatten. Gains shrink later. marginal gain

Memory matters. Workers consume memory. memory

CPU parallelism sweep showing total time and diminishing marginal runtime gains

Capture early speedup.

Execution model matters next.

EXECUTION MODEL

No-GIL changes the process-vs-thread tradeoff.

CPU-bound Python permutation/resampling work over shared in-memory data.

Runtime scaling for standard GIL threads and free-threaded threads across ThreadPoolExecutor workers

Standard GIL builds ~1.02× 1 → 16 workers py314 stays flat

CPython 3.14t / free-threaded 4.77× 1 → 16 workers threads scale this workload

Execution-model payoff 2.43× faster 32% less peak RSS ThreadPool vs ProcessPool

The claim is not that Python is automatically faster; it is that free-threaded CPython makes threads a plausible shared-data parallel strategy for this workload shape.

Before choosing Numba, NumPy, or JAX, we can also choose how the work is executed: processes, threads, or accelerator kernels. The workload here is permutation-style resampling: many mostly independent Python tasks running over data that we would like to keep shared in memory. With the standard GIL build, adding ThreadPoolExecutor workers does not give meaningful CPU parallelism. From 1 to 16 workers, the runtime is essentially flat, no improvement. With the free-threaded CPython build, the same thread-based design scales to about 4.77× from 1 to 16 workers. That is the important evidence: the difference is not Python version, it is whether Python threads can execute this CPU-bound work in parallel. This does not mean every Python program gets faster. It works here because the workload has the right shape: mostly independent CPU-bound Python work, shared read-heavy data, and limited cross-thread coordination. That is why the payoff is not only runtime. In the pool comparison, the ThreadPool path is about 2.43× faster than the ProcessPool path and uses about 32% less peak RSS, because it can keep the large in-memory data inside one process instead of paying process, serialization, IPC, and memory overhead. So the takeaway is: free-threaded CPython changes the process-versus-thread tradeoff for this workload shape. Threads become a real option when memory sharing matters, the work is independent enough, and the statistical validation still passes.

Connecting the tools

What each tool solved.

Match tool to shape.

Numba

Bottleneck:
CPU loops.

Solved: K-means assignment.

NumPy / BLAS

Bottleneck:
dense algebra.

Solved: Distance identity.

Threads / workers

Bottleneck:
shared data repeats.

Solved: Permutation workers.

JAX / GPU

Bottleneck:
device-resident batches.

Solved: Streamed W @ X.

Same statistic. Different tool.

DECISION GUIDE

Evidence to tool.

Follow bottlenecks, not the leaderboard.

1Simulate

2Validate

3Find the bottleneck

4Pick the suitable tool

Hot loop

SignalLoop dominates.

LimitPython overhead.

MoveNumba.

k-means loops

Matrix algebra

SignalMatrix operations.

LimitDense kernel.

MoveNumPy / BLAS.

distance identity

Repeat work

SignalRepeats over data.

LimitScheduling / memory.

MoveThreads or workers.

permutation workers

Large streamed batch

SignalBatches fit device.

LimitRegular array work.

MoveJAX / GPU.

GPU batch

Validate first. Move bottlenecks.

This brings us to the final decision guide. The workflow is systematic: simulate, validate, identify the bottleneck, and then choose the suitable tool that preserves the checked result. On a personal note, I think choosing a computational engine is a lot like choosing transportation. I traveled here to LA from Hong Kong. For that distance, I took an airplane; nobody expected me to swim across the Pacific. Then, from LA to Long Beach, I took the Metro. Interestingly, my AI routing app confidently recommended a two-and-a-half-hour bus ride instead. That was a useful reminder: automation can suggest a route, but we still have to understand the constraints. And finally, from my hotel to this venue, I walked about 5 min. For that short local trip, walking had almost zero overhead. Calling a car would not have made it better. Computational bottlenecks work the same way. You do not choose the vehicle first; you look at the shape of the trip. If the bottleneck is a tight local numerical loop—many small point-to-centroid comparisons—Numba is the right vehicle. It keeps the algorithm explicit but removes Python loop overhead. If the bottleneck becomes dense matrix algebra, NumPy is the express train, utilizing optimized linear algebra underneath. If the bottleneck is repeated CPU work over shared arrays, then threads or Python 3.14 can help. And if the workload is massive, regular, and batched enough to amortize transfer overhead, that is when JAX on an A100 becomes the long-haul flight. So the rule is not: always use Numba, always use NumPy, or always use a GPU. The rule is: validate first, understand the shape of the journey, then move the bottleneck with the suitable tool that fits.

AI coding tools

AI scales the workflow. We own the claims.

AI automation.

Implementation variants.
Scenario runners.
Metadata capture.
Plot regeneration.
Result manifests.

AI assistant automating experiment tasks

Scientific judgment.

Statistical target.
Data assumptions.
Validation criteria.
Hard-case interpretation.
Scientific claims.

Statistician making scientific decisions

AI automates. Humans claim.

Close

Make the statistic testable.
Then make the bottleneck fast.

Clear enough to validate. Fast enough to scale.

Thank you

Questions?

Slides. Reproducible artifacts.

github.com/LucaJiang/FastStatisticalModels4Python

QR code for slides and reproducible artifacts repository

Slides. Reproducible artifacts.

Backup: Evidence scope.

Evidence map. Different questions.

Why three environments?

MacBook = validation.

Equivalence. Calibration. Local shape.

Server = scale. Parallelism.

Shape scaling. Memory sweeps.

GPU = accelerator pipeline.

Break-even evidence. Streamed reduction.

Detailed metrics in repository.

Backup: Local validation.

Validation inventory.

Code. Inference. Runtime shape.

Backup: K-means loops.

Numba wins loop shapes.

Q&A scope

Question: When does Numba win?

Local tier: MacBook runs.

Answer: When explicit loops dominate.

Loops favor Numba. Algebra favors BLAS.

Backup: Statistical power.

Power vs. effect size.

Do we detect strong signals?

permutation power curve showing stronger simulated signals are easier to detect

Strong signals yield high power.

Backup: batch_R=8192.

BACKUP: GPU TUNING

Backup: what batch_R tuning showed

Larger batches amortized overhead; 8192 was fastest only within this measured grid.

A100 batch_R tuning with full end-to-end time and CPU over A100 speedup

Timing improved most at small batch_R; after 512-1024 the curve mostly flattened. 8192 is the fastest accepted value measured here.

Definition batch_R = permutation rows processed in one A100 batch.

Trend Small batches were slower; larger batches reduced stream overhead.

Diminishing returns Most of the gain appeared by 512-1024; later points were nearly flat.

Claim boundary 8192 was fastest in this measured grid. Larger values were not tested here.

batch_R tunes pipeline overhead; it does not change the statistical question.

This backup slide is about the trend, not a proof of optimality. batch_R is the number of permutation rows processed in one A100 batch. In this committed sweep, small batches were slower because the streamed path paid more scheduling and loop overhead. Increasing batch_R reduced that overhead, but the gains flattened after roughly 512-1024 and the curve is not perfectly monotone. 8192 was the fastest accepted value in the measured 128-8192 grid, so it was used as the chosen setting for the later A100 decision map. Larger batch_R values were not part of this committed sweep; they could be faster, slower, or fail because of memory or autotuning. batch_R changes scheduling, not the permutation statistic or p-value definition. Full end-to-end timing remains the criterion: compile excluded, transfer included, kernel-only excluded.

Backup: AI workflow.

What the agent automated.

Experiment orchestration.

Runners. Scenarios. Manifests.

Visualization surface.

Regenerated figures from CSVs.

Repository surface.

Commands. Summaries. QA notes.

AI automated. Humans claimed.

Backup: standard GIL vs free-threaded

Scaling evidence.

Speedup versus each interpreter's own one-worker baseline

Compare scaling, not base speed.

ThreadPoolExecutor workers

Build	1 worker	16 workers	Ratio
py314 standard GIL	0.584s	0.574s	1.02×
CPython 3.14t free-threaded	0.565s	0.119s	4.77×

CPython 3.14 answers the standard-GIL baseline; CPython 3.14t answers the free-threaded execution-model question.

ThreadPool vs ProcessPool runtime

py314 ProcessPool0.571s 3.14t ThreadPool0.235s

2.43× faster

ThreadPool vs ProcessPool peak RSS

py314 ProcessPool1.017 GiB 3.14t ThreadPool0.692 GiB

32% lower

ThreadPool shares dataset.

Breaking the Speed Limit

Speed follows validation

Target First

Executable Reference

Conditional Speed

Simulation creates an answer key

Simulation turns statistical behavior into tests

Python helps us bridge the gap between math and machines.

Two workloads, two bottleneck shapes

k-means. Iterative fitting.

Permutations. Resampling inference.

k-means iterative pressure.

K-means movement.

Initialize. Then update.

K-means is a useful test.

Assignment-update shape.

Validation contract.

Harder scenarios.

Bottleneck shapes.

Why it generalizes.

Did k-means recover groups?

Compare against reference.

Three ways to compute.

NumPy

Numba

JAX / GPU

Engine depends on shape.

Scalar-heavy loops.

Dense distance algebra.

Massive regular batches.

Permutations. Resampling inference pressure.

Permutation test scaling.

One resampling loop.

Same math. New shape.

Reference loop.

Optimized path.

Equivalence: loop vs. matrix.

Same inputs.

Reference.

Optimized path.

Acceptance criteria

Null calibration check.

When does runtime bend?

Same statistic.

Local runtime.

Scale question.

When do GPUs help?

Parallelism helps. Then flattens.

No-GIL changes the process-vs-thread tradeoff.

What each tool solved.

Numba

NumPy / BLAS

Threads / workers

JAX / GPU

Evidence to tool.

Hot loop

Matrix algebra

Repeat work

Large streamed batch

AI scales the workflow. We own the claims.

AI automation.

Scientific judgment.

Make the statistic testable.Then make the bottleneck fast.

Questions?

Evidence map. Different questions.

MacBook = validation.

Server = scale. Parallelism.

GPU = accelerator pipeline.

Validation inventory.

Numba wins loop shapes.

Q&A scope

Power vs. effect size.

Backup: what batch_R tuning showed

What the agent automated.

Experiment orchestration.

Visualization surface.

Repository surface.

Scaling evidence.

ThreadPoolExecutor workers

Make the statistic testable.
Then make the bottleneck fast.