This directory is the curated CPU-side output from the long_safe_20260503_190133 server run on BI103202.
Run conditions:
/home/wjiang49/conda_envs/fsm4py312experiments/server/long_safe_orchestrator.pyexperiments/server/long_safe_plots.pyRegenerate the figures and summaries from the repository root:
PYTHONPATH=experiments \
/home/wjiang49/conda_envs/fsm4py312/bin/python -c '
from pathlib import Path
from server.long_safe_orchestrator import plot_and_summarize
plot_and_summarize(
Path("experiments/results/linux_server_cpu/long_safe_20260503_190133"),
Path("experiments/results/linux_server_a100/long_safe_20260503_190133"),
)
'
kmeans_cpu_scaling.csv: 540 passed k-means CPU scenarios across N, d, K, separation, seed, and implementation.kmeans_numba_thread_sweep.csv: 8 passed Numba thread sweep scenarios for N=1,000,000, d=64, K=20.permutation_cpu_scaling.csv: 105 passed CPU permutation scenarios and 3 timeouts at the largest n=50,000, p=50,000, R=100,000 shape.permutation_worker_sweep.csv: 8 passed ThreadPool worker sweep scenarios for n=5,000, p=10,000, R=10,000.permutation_calibration_server_subset.csv: 2 null calibration scenarios.env.json: environment and resource capture for the run.figures/kmeans_cpu_runtime.pngGenerated by plot_kmeans_cpu() in experiments/server/long_safe_plots.py.
Inputs and transformation:
kmeans_cpu_scaling.csv.validation_status in {pass, check}.K=20 and separation=2.0.implementation, d, and N; plots median warm_median_s over seeds.N and runtime.Conclusion supported by this figure:
CPU k-means scaling is strongly shape-dependent. Numba is clearly better for low-dimensional d=10, while NumPy matmul becomes competitive at larger d, especially where BLAS handles the dense matrix work efficiently.
figures/kmeans_numba_threads.pngGenerated by plot_kmeans_cpu() in experiments/server/long_safe_plots.py.
Inputs and transformation:
kmeans_numba_thread_sweep.csv.threads; plots median warm runtime and speedup relative to 1 thread.N=1,000,000, d=64, K=20.Conclusion supported by this figure:
Numba scaling improves substantially up to about 32 to 64 threads in this experiment. The 128-thread point is slower than 32/64, so using all server threads is not automatically better.
figures/kmeans_memory_scaling.pngGenerated by plot_kmeans_cpu() in experiments/server/long_safe_plots.py.
Inputs and transformation:
kmeans_cpu_scaling.csv.estimated_host_gib as the x-axis and host_peak_mem_mb as the y-axis.implementation and estimated host working set; plots median RSS at scenario end.Conclusion supported by this figure:
Observed RSS generally increases with the estimated working set. The plot is a safety diagnostic, not a precise allocator trace, because each scenario records child RSS near the end of the run rather than peak sampled memory.
figures/permutation_cpu_runtime.pngGenerated by plot_permutation_cpu() in experiments/server/long_safe_plots.py.
Inputs and transformation:
permutation_cpu_scaling.csv.validation_status in {pass, check} and batch_R=512.n, p, and R; plots median warm runtime for selected large shapes.R and runtime.Conclusion supported by this figure:
CPU permutation runtime grows quickly with both feature count and permutation count. The three timeout rows all occur at n=50,000, p=50,000, R=100,000, so that corner is too expensive as a CPU baseline under the 45-minute per-scenario cap.
figures/permutation_worker_sweep.pngGenerated by plot_permutation_cpu() in experiments/server/long_safe_plots.py.
Inputs and transformation:
permutation_worker_sweep.csv.workers; plots median warm runtime.n=5,000, p=10,000, R=10,000, batch_R=512.Conclusion supported by this figure:
ThreadPool workers do not scale monotonically. The 8-worker point is the best in this run; higher worker counts add overhead and do not improve runtime reliably.
figures/process_vs_thread_memory.pngGenerated by plot_permutation_cpu() in experiments/server/long_safe_plots.py.
Inputs and transformation:
permutation_worker_sweep.csv.workers; plots median child RSS together with speedup relative to 1 worker.Conclusion supported by this figure:
The memory/runtime trade-off is non-monotonic. More workers can increase RSS without giving proportional speedup, so worker count should be tuned rather than maximized.