This directory is the curated A100-side output from the long_safe_20260503_190133 server run on BI103202.
Run conditions:
/home/wjiang49/conda_envs/fsm4py312env.json: driver 575.57.08, CUDA runtime reported by nvidia-smi as 12.9experiments/server/long_safe_orchestrator.pyexperiments/server/long_safe_plots.pyXLA_PYTHON_CLIENT_PREALLOCATE=false, 55 GiB estimated GPU working-set cap, one A100 child at a time.Regenerate the figures and summaries from the repository root:
PYTHONPATH=experiments \
/home/wjiang49/conda_envs/fsm4py312/bin/python -c '
from pathlib import Path
from server.long_safe_orchestrator import plot_and_summarize
plot_and_summarize(
Path("experiments/results/linux_server_cpu/long_safe_20260503_190133"),
Path("experiments/results/linux_server_a100/long_safe_20260503_190133"),
)
'
kmeans_jax_gpu.csv: 135 passed A100 k-means scenarios across N, d, K, and seed.permutation_matrix_gpu.csv: 15 passed A100 permutation scenarios covering batch sweep, feature scaling, permutation scaling, and larger n. The overlapping long-safe matrix point n=5,000, p=50,000, R=5,000, batch_R=512 was de-duplicated so this curated CSV has one row per scenario_id.env.json: environment and resource capture for the run.Some figures in this directory compare against CPU data from ../../linux_server_cpu/long_safe_20260503_190133.
figures/kmeans_jax_cold_vs_warm.pngGenerated by plot_kmeans_a100() in experiments/server/long_safe_plots.py.
Inputs and transformation:
kmeans_jax_gpu.csv.validation_status in {pass, check}.K=20.d and N; plots median cold_time_s and median warm_median_s over seeds.N and runtime.Conclusion supported by this figure:
A100 warm k-means runtime scales smoothly with N, while cold time includes JAX compilation/first-run overhead. For presentation claims, use warm medians for throughput and show cold time separately as startup cost.
figures/kmeans_cpu_gpu_break_even.pngGenerated by plot_kmeans_a100() in experiments/server/long_safe_plots.py.
Inputs and transformation:
kmeans_jax_gpu.csv.kmeans_cpu_scaling.csv.N, d, and K=20; CPU uses separation=2.0.CPU warm time / A100 warm time; values above 1 mean A100 is faster.Conclusion supported by this figure:
A100 is not uniformly better for every shape. It is most compelling for larger N and lower/mid dimensional shapes in this run, reaching about 5.8x over the best CPU baseline at N=5,000,000, d=10, K=20. For d=256, the advantage shrinks and one large point is roughly break-even.
figures/permutation_gpu_runtime.pngGenerated by plot_permutation_a100() in experiments/server/long_safe_plots.py.
Inputs and transformation:
permutation_matrix_gpu.csv.n=5,000, R=5,000, batch_R=512.p values from overlapping long-safe scenario groups are collapsed by median before plotting.p against median warm runtime on log scales.Conclusion supported by this figure:
GPU permutation runtime increases with feature count, but the curve is interpreted only within this fixed n/R/batch_R slice. It should not be mixed with the batch sweep or larger-n scenarios.
figures/permutation_gpu_batch_sweep.pngGenerated by plot_permutation_a100() in experiments/server/long_safe_plots.py.
Inputs and transformation:
permutation_matrix_gpu.csv.n=5,000, p=50,000, R=5,000.batch_R values from overlapping scenario groups are collapsed by median before plotting.batch_R against median warm runtime as connected tuning points, without a fitted scaling line.Conclusion supported by this figure:
Batch size matters, but bigger is not always better. In this run, moderate batches are best; batch_R=2048 is slower than the middle of the sweep.
figures/permutation_cpu_gpu_break_even.pngGenerated by plot_permutation_a100() in experiments/server/long_safe_plots.py.
Inputs and transformation:
permutation_matrix_gpu.csv.permutation_cpu_scaling.csv.n=5,000, p=50,000, batch_R=512 by common R.CPU warm time / A100 warm time.Conclusion supported by this figure:
For the old matched R=1,000 and R=10,000 points at n=5,000, p=50,000, and batch_R=512, CPU is faster than A100 in this implementation. This is historical pre-break-even evidence: it predates the streamed-reduction follow-up, larger batch_R sweep, and broader shape sweep. It should not be read as the final A100 permutation conclusion.
figures/permutation_matrix_reformulation.pngGenerated by plot_permutation_a100() in experiments/server/long_safe_plots.py.
Inputs and transformation:
permutation_matrix_gpu.csv: stream batches of contrast matrices, compute W_batch @ X, and accumulate exceedance counts.Conclusion supported by this figure:
The GPU implementation is designed around batched matrix products and streaming counts, not materializing a full R x p result matrix.