Date: April 28th, 2026 11:50 PM
Author: https://imgur.com/a/o2g8xYK
Some CEOdood just poasted this
"so I ran some more math, and newer Xeon machines (Grand Rapids etc) with 1.5TB of RAM (MRDIMM - more throughput than DDR5) can deal with more concurrent users on MoE models like Kimi 2.6 and GLM 5.1 with longer context windows (>200k) though with reduced tok/sec/user for about 1/4-1/5 of the price of B200x8 or 2x(H100x8).
That means that some of my "batching" theories, assuming I can prove dedicated kernels on AMX and AVX512, means that headless coding agents that don't have a human following them, that runs in the background can be executed on said models in a MUCH cheaper way.
running a first test on Grand Rapids CPUs (only managed to get 16 CPUs - GCP won't give me quota).
You would think with the amount of models being downloaded, most clouds should benefit from a proxy for those the same way they do for linux packages..."
(http://www.autoadmit.com/thread.php?thread_id=5861511&forum_id=2#49850253)