FastStatisticalModels4Python

Linux Server A100 Long-Safe Results

This directory is the curated A100-side output from the long_safe_20260503_190133 server run on BI103202.

Run conditions:

Python environment: /home/wjiang49/conda_envs/fsm4py312
GPU: NVIDIA A100 80GB PCIe
Driver/CUDA as captured in env.json: driver 575.57.08, CUDA runtime reported by nvidia-smi as 12.9
Scheduler: experiments/server/long_safe_orchestrator.py
Plotting code: experiments/server/long_safe_plots.py
GPU policy: XLA_PYTHON_CLIENT_PREALLOCATE=false, 55 GiB estimated GPU working-set cap, one A100 child at a time.

Regenerate the figures and summaries from the repository root:

PYTHONPATH=experiments \
/home/wjiang49/conda_envs/fsm4py312/bin/python -c '
from pathlib import Path
from server.long_safe_orchestrator import plot_and_summarize
plot_and_summarize(
    Path("experiments/results/linux_server_cpu/long_safe_20260503_190133"),
    Path("experiments/results/linux_server_a100/long_safe_20260503_190133"),
)
'

Data Files

kmeans_jax_gpu.csv: 135 passed A100 k-means scenarios across N, d, K, and seed.
permutation_matrix_gpu.csv: 15 passed A100 permutation scenarios covering batch sweep, feature scaling, permutation scaling, and larger n. The overlapping long-safe matrix point n=5,000, p=50,000, R=5,000, batch_R=512 was de-duplicated so this curated CSV has one row per scenario_id.
env.json: environment and resource capture for the run.

Some figures in this directory compare against CPU data from ../../linux_server_cpu/long_safe_20260503_190133.

Figures

`figures/kmeans_jax_cold_vs_warm.png`

Generated by plot_kmeans_a100() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Reads kmeans_jax_gpu.csv.
Keeps validation_status in {pass, check}.
Uses the clean comparison slice K=20.
Groups by d and N; plots median cold_time_s and median warm_median_s over seeds.
Uses log scales for N and runtime.
Points are semi-transparent medians; dashed lines are log-log power-law fits for series with at least three points, with the fitted slope shown in each panel legend.

Conclusion supported by this figure:

A100 warm k-means runtime scales smoothly with N, while cold time includes JAX compilation/first-run overhead. For presentation claims, use warm medians for throughput and show cold time separately as startup cost.

`figures/kmeans_cpu_gpu_break_even.png`

Generated by plot_kmeans_a100() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Reads A100 kmeans_jax_gpu.csv.
Reads CPU kmeans_cpu_scaling.csv.
Matches scenarios by N, d, and K=20; CPU uses separation=2.0.
For each matched shape, selects the fastest CPU implementation median and compares it to A100 warm median.
Right panel plots CPU warm time / A100 warm time; values above 1 mean A100 is faster.
The runtime panel uses semi-transparent median points and dashed log-log power-law fits for series with at least three points; the ratio panel keeps only measured points/lines.

Conclusion supported by this figure:

A100 is not uniformly better for every shape. It is most compelling for larger N and lower/mid dimensional shapes in this run, reaching about 5.8x over the best CPU baseline at N=5,000,000, d=10, K=20. For d=256, the advantage shrinks and one large point is roughly break-even.

`figures/permutation_gpu_runtime.png`

Generated by plot_permutation_a100() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Reads permutation_matrix_gpu.csv.
Uses the feature scaling slice n=5,000, R=5,000, batch_R=512.
Repeated p values from overlapping long-safe scenario groups are collapsed by median before plotting.
Plots p against median warm runtime on log scales.
Points are semi-transparent medians; the dashed line is a log-log power-law fit with the fitted slope shown in the legend.

Conclusion supported by this figure:

GPU permutation runtime increases with feature count, but the curve is interpreted only within this fixed n/R/batch_R slice. It should not be mixed with the batch sweep or larger-n scenarios.

`figures/permutation_gpu_batch_sweep.png`

Generated by plot_permutation_a100() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Reads permutation_matrix_gpu.csv.
Uses the batch sweep slice n=5,000, p=50,000, R=5,000.
Repeated batch_R values from overlapping scenario groups are collapsed by median before plotting.
Plots batch_R against median warm runtime as connected tuning points, without a fitted scaling line.

Conclusion supported by this figure:

Batch size matters, but bigger is not always better. In this run, moderate batches are best; batch_R=2048 is slower than the middle of the sweep.

`figures/permutation_cpu_gpu_break_even.png`

Generated by plot_permutation_a100() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Reads A100 permutation_matrix_gpu.csv.
Reads CPU permutation_cpu_scaling.csv.
Matches the clean comparison slice n=5,000, p=50,000, batch_R=512 by common R.
Plots CPU and A100 warm medians plus the ratio CPU warm time / A100 warm time.
The runtime panel uses semi-transparent median points and dashed log-log power-law fits only when a series has at least three matched points; the ratio panel keeps only measured points/lines.

Conclusion supported by this figure:

For the old matched R=1,000 and R=10,000 points at n=5,000, p=50,000, and batch_R=512, CPU is faster than A100 in this implementation. This is historical pre-break-even evidence: it predates the streamed-reduction follow-up, larger batch_R sweep, and broader shape sweep. It should not be read as the final A100 permutation conclusion.

`figures/permutation_matrix_reformulation.png`

Generated by plot_permutation_a100() in experiments/server/long_safe_plots.py.

Inputs and transformation:

This is an explanatory schematic generated directly by matplotlib.
It summarizes the algorithmic reformulation used by permutation_matrix_gpu.csv: stream batches of contrast matrices, compute W_batch @ X, and accumulate exceedance counts.

Conclusion supported by this figure:

The GPU implementation is designed around batched matrix products and streaming counts, not materializing a full R x p result matrix.

This site is open source. Improve this page.