FastStatisticalModels4Python

Linux Server CPU Long-Safe Results

This directory is the curated CPU-side output from the long_safe_20260503_190133 server run on BI103202.

Run conditions:

Regenerate the figures and summaries from the repository root:

PYTHONPATH=experiments \
/home/wjiang49/conda_envs/fsm4py312/bin/python -c '
from pathlib import Path
from server.long_safe_orchestrator import plot_and_summarize
plot_and_summarize(
    Path("experiments/results/linux_server_cpu/long_safe_20260503_190133"),
    Path("experiments/results/linux_server_a100/long_safe_20260503_190133"),
)
'

Data Files

Figures

figures/kmeans_cpu_runtime.png

Generated by plot_kmeans_cpu() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Conclusion supported by this figure:

CPU k-means scaling is strongly shape-dependent. Numba is clearly better for low-dimensional d=10, while NumPy matmul becomes competitive at larger d, especially where BLAS handles the dense matrix work efficiently.

figures/kmeans_numba_threads.png

Generated by plot_kmeans_cpu() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Conclusion supported by this figure:

Numba scaling improves substantially up to about 32 to 64 threads in this experiment. The 128-thread point is slower than 32/64, so using all server threads is not automatically better.

figures/kmeans_memory_scaling.png

Generated by plot_kmeans_cpu() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Conclusion supported by this figure:

Observed RSS generally increases with the estimated working set. The plot is a safety diagnostic, not a precise allocator trace, because each scenario records child RSS near the end of the run rather than peak sampled memory.

figures/permutation_cpu_runtime.png

Generated by plot_permutation_cpu() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Conclusion supported by this figure:

CPU permutation runtime grows quickly with both feature count and permutation count. The three timeout rows all occur at n=50,000, p=50,000, R=100,000, so that corner is too expensive as a CPU baseline under the 45-minute per-scenario cap.

figures/permutation_worker_sweep.png

Generated by plot_permutation_cpu() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Conclusion supported by this figure:

ThreadPool workers do not scale monotonically. The 8-worker point is the best in this run; higher worker counts add overhead and do not improve runtime reliably.

figures/process_vs_thread_memory.png

Generated by plot_permutation_cpu() in experiments/server/long_safe_plots.py.

Inputs and transformation:

Conclusion supported by this figure:

The memory/runtime trade-off is non-monotonic. More workers can increase RSS without giving proportional speedup, so worker count should be tuned rather than maximized.