FastStatisticalModels4Python

Linux Server A100 Long-Safe Results

This directory is the curated A100-side output from the long_safe_20260503_190133 server run on BI103202.

Run conditions:

Regenerate the figures and summaries from the repository root:

PYTHONPATH=experiments \
/home/wjiang49/conda_envs/fsm4py312/bin/python -c '
from pathlib import Path
from server.long_safe_orchestrator import plot_and_summarize
plot_and_summarize(
    Path("experiments/results/linux_server_cpu/long_safe_20260503_190133"),
    Path("experiments/results/linux_server_a100/long_safe_20260503_190133"),
)
'

Data Files

Some figures in this directory compare against CPU data from ../../linux_server_cpu/long_safe_20260503_190133.

Figures

figures/kmeans_jax_cold_vs_warm.png

Generated by plot_kmeans_a100() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Conclusion supported by this figure:

A100 warm k-means runtime scales smoothly with N, while cold time includes JAX compilation/first-run overhead. For presentation claims, use warm medians for throughput and show cold time separately as startup cost.

figures/kmeans_cpu_gpu_break_even.png

Generated by plot_kmeans_a100() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Conclusion supported by this figure:

A100 is not uniformly better for every shape. It is most compelling for larger N and lower/mid dimensional shapes in this run, reaching about 5.8x over the best CPU baseline at N=5,000,000, d=10, K=20. For d=256, the advantage shrinks and one large point is roughly break-even.

figures/permutation_gpu_runtime.png

Generated by plot_permutation_a100() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Conclusion supported by this figure:

GPU permutation runtime increases with feature count, but the curve is interpreted only within this fixed n/R/batch_R slice. It should not be mixed with the batch sweep or larger-n scenarios.

figures/permutation_gpu_batch_sweep.png

Generated by plot_permutation_a100() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Conclusion supported by this figure:

Batch size matters, but bigger is not always better. In this run, moderate batches are best; batch_R=2048 is slower than the middle of the sweep.

figures/permutation_cpu_gpu_break_even.png

Generated by plot_permutation_a100() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Conclusion supported by this figure:

For the old matched R=1,000 and R=10,000 points at n=5,000, p=50,000, and batch_R=512, CPU is faster than A100 in this implementation. This is historical pre-break-even evidence: it predates the streamed-reduction follow-up, larger batch_R sweep, and broader shape sweep. It should not be read as the final A100 permutation conclusion.

figures/permutation_matrix_reformulation.png

Generated by plot_permutation_a100() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Conclusion supported by this figure:

The GPU implementation is designed around batched matrix products and streaming counts, not materializing a full R x p result matrix.