My Whatsapp group with BIGTECH bros is >> xo these days
| https://imgur.com/a/o2g8xYK | 04/28/26 | | https://imgur.com/a/o2g8xYK | 04/28/26 | | Juan Eighty | 04/28/26 | | Juan Eighty | 04/28/26 | | https://imgur.com/a/o2g8xYK | 04/29/26 | | oomox | 04/29/26 | | https://imgur.com/a/o2g8xYK | 04/29/26 | | https://imgur.com/a/o2g8xYK | 04/29/26 | | https://imgur.com/a/o2g8xYK | 04/29/26 | | https://imgur.com/a/o2g8xYK | 04/29/26 | | Emotionally + Physically Abusive Ex-Husband | 04/29/26 |
Poast new message in this thread
Date: April 28th, 2026 11:49 PM
Author: https://imgur.com/a/o2g8xYK
It's invite only and everything is AI related and it's 180
(http://www.autoadmit.com/thread.php?thread_id=5861511&forum_id=2,#49850251) |
Date: April 28th, 2026 11:50 PM
Author: https://imgur.com/a/o2g8xYK
Some CEOdood just poasted this
"so I ran some more math, and newer Xeon machines (Grand Rapids etc) with 1.5TB of RAM (MRDIMM - more throughput than DDR5) can deal with more concurrent users on MoE models like Kimi 2.6 and GLM 5.1 with longer context windows (>200k) though with reduced tok/sec/user for about 1/4-1/5 of the price of B200x8 or 2x(H100x8).
That means that some of my "batching" theories, assuming I can prove dedicated kernels on AMX and AVX512, means that headless coding agents that don't have a human following them, that runs in the background can be executed on said models in a MUCH cheaper way.
running a first test on Grand Rapids CPUs (only managed to get 16 CPUs - GCP won't give me quota).
You would think with the amount of models being downloaded, most clouds should benefit from a proxy for those the same way they do for linux packages..."
(http://www.autoadmit.com/thread.php?thread_id=5861511&forum_id=2,#49850253)
|
Date: April 28th, 2026 11:50 PM Author: Juan Eighty
You posted this exact thread last week
Provide updated screenshots or gtfo
(http://www.autoadmit.com/thread.php?thread_id=5861511&forum_id=2,#49850254) |
 |
Date: April 29th, 2026 12:48 AM
Author: https://imgur.com/a/o2g8xYK
I can't tell what these people are saying a lot of the time and redacting shit is tedious
https://i.imgur.com/F6Nwlsl.png
(http://www.autoadmit.com/thread.php?thread_id=5861511&forum_id=2,#49850321) |
Date: April 29th, 2026 12:37 AM
Author: https://imgur.com/a/o2g8xYK
i have a couple of "code parser in the sampling loop" + "code parser controlled KV cache endpoints" experiments that are pretty promising, also in the context of CPU inference. I don't really have time nor knowledge to properly explore them, but if someone is interested here, I feel this can unlock some pretty interesting idea to keep smaller models on track and just in general exploit the massive power that comes with controlling the inference end to end. I did a couple of experiments and my ideas do work (very messy repo: https://github.com/REDACTED), if someone is interested in a chat. (I think small local models + custom inference harness + tools optimized on trace analysis / GEPA-like/autoresearch loops are potentially competitive with much much much larger models for a fair amount of the "lower level but token consuming" tasks in the context of coding.
roughly the idea is that if you have a AST parser + compiler/interpreter in the loop, you can do the following things:
- flag syntax errors (obv) + easily gaslight the model into thinking it wrote them correctly
- autocomplete a set of patterns to save on inference compute by just prefilling the completion
- rewind the cache / inference pointer (not sure what the proper name is) to semantically meaningful points (say, function starts)
- insert // docstrings that help the model along for trickier APIs
- for certain types of functions / unit tests, run the unit tests right after the code was generated, parallelizing with inference, and you can easily rewind
- on CPU, where you are memory constrained (I guess it's the same on GPUs but I haven't done GPGPU stuff since 2012), you can run speculative loops / split the code writing across multiple cores and then merge
All of these things mostly benefit from it being tailored to your own codebase / being RL-optimized to your codebase through transcript analysis, so model providers who are more reliant on batching similar jobloads and often are "too fast" for this kind of trickery won't really benefit from it.
(http://www.autoadmit.com/thread.php?thread_id=5861511&forum_id=2,#49850311) |
|
|