FastStatisticalModels4Python

Linux Server CPU Long-Safe Results

This directory is the curated CPU-side output from the long_safe_20260503_190133 server run on BI103202.

Run conditions:

Python environment: /home/wjiang49/conda_envs/fsm4py312
Machine: 512 logical CPU threads, about 3.0 TiB RAM
Scheduler: experiments/server/long_safe_orchestrator.py
Plotting code: experiments/server/long_safe_plots.py
CPU policy used during the final run: dynamic total CPU parallelism targeting up to 80% load headroom, capped at 128 requested threads/workers, with host memory guardrails.

Regenerate the figures and summaries from the repository root:

PYTHONPATH=experiments \
/home/wjiang49/conda_envs/fsm4py312/bin/python -c '
from pathlib import Path
from server.long_safe_orchestrator import plot_and_summarize
plot_and_summarize(
    Path("experiments/results/linux_server_cpu/long_safe_20260503_190133"),
    Path("experiments/results/linux_server_a100/long_safe_20260503_190133"),
)
'

Data Files

kmeans_cpu_scaling.csv: 540 passed k-means CPU scenarios across N, d, K, separation, seed, and implementation.
kmeans_numba_thread_sweep.csv: 8 passed Numba thread sweep scenarios for N=1,000,000, d=64, K=20.
permutation_cpu_scaling.csv: 105 passed CPU permutation scenarios and 3 timeouts at the largest n=50,000, p=50,000, R=100,000 shape.
permutation_worker_sweep.csv: 8 passed ThreadPool worker sweep scenarios for n=5,000, p=10,000, R=10,000.
permutation_calibration_server_subset.csv: 2 null calibration scenarios.
env.json: environment and resource capture for the run.

Figures

`figures/kmeans_cpu_runtime.png`

Generated by plot_kmeans_cpu() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Reads kmeans_cpu_scaling.csv.
Keeps validation_status in {pass, check}.
Uses the clean comparison slice K=20 and separation=2.0.
Groups by implementation, d, and N; plots median warm_median_s over seeds.
Uses log scales for both N and runtime.
Points are semi-transparent medians; dashed lines are log-log power-law fits for series with at least three points, with the fitted slope shown in each panel legend.

Conclusion supported by this figure:

CPU k-means scaling is strongly shape-dependent. Numba is clearly better for low-dimensional d=10, while NumPy matmul becomes competitive at larger d, especially where BLAS handles the dense matrix work efficiently.

`figures/kmeans_numba_threads.png`

Generated by plot_kmeans_cpu() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Reads kmeans_numba_thread_sweep.csv.
Groups by threads; plots median warm runtime and speedup relative to 1 thread.
Shape is fixed at N=1,000,000, d=64, K=20.

Conclusion supported by this figure:

Numba scaling improves substantially up to about 32 to 64 threads in this experiment. The 128-thread point is slower than 32/64, so using all server threads is not automatically better.

`figures/kmeans_memory_scaling.png`

Generated by plot_kmeans_cpu() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Reads kmeans_cpu_scaling.csv.
Uses estimated_host_gib as the x-axis and host_peak_mem_mb as the y-axis.
Groups by implementation and estimated host working set; plots median RSS at scenario end.
Points are semi-transparent medians; dashed lines are log-log power-law fits for series with at least three points, used as a visual memory-growth diagnostic.

Conclusion supported by this figure:

Observed RSS generally increases with the estimated working set. The plot is a safety diagnostic, not a precise allocator trace, because each scenario records child RSS near the end of the run rather than peak sampled memory.

`figures/permutation_cpu_runtime.png`

Generated by plot_permutation_cpu() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Reads permutation_cpu_scaling.csv.
Keeps validation_status in {pass, check} and batch_R=512.
Groups by n, p, and R; plots median warm runtime for selected large shapes.
Uses log scales for R and runtime.
Points are semi-transparent medians; dashed lines are log-log power-law fits for series with at least three points, with the fitted slope shown in the legend.

Conclusion supported by this figure:

CPU permutation runtime grows quickly with both feature count and permutation count. The three timeout rows all occur at n=50,000, p=50,000, R=100,000, so that corner is too expensive as a CPU baseline under the 45-minute per-scenario cap.

`figures/permutation_worker_sweep.png`

Generated by plot_permutation_cpu() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Reads permutation_worker_sweep.csv.
Groups by workers; plots median warm runtime.
Fixed shape: n=5,000, p=10,000, R=10,000, batch_R=512.

Conclusion supported by this figure:

ThreadPool workers do not scale monotonically. The 8-worker point is the best in this run; higher worker counts add overhead and do not improve runtime reliably.

`figures/process_vs_thread_memory.png`

Generated by plot_permutation_cpu() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Reads permutation_worker_sweep.csv.
Groups by workers; plots median child RSS together with speedup relative to 1 worker.

Conclusion supported by this figure:

The memory/runtime trade-off is non-monotonic. More workers can increase RSS without giving proportional speedup, so worker count should be tuned rather than maximized.

This site is open source. Improve this page.