NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz (twitter.com)
cadamsdotcom 3 hours ago [-]
Transformers scale poorly vs. context window size and parameter count.

Which means really impressive when those N’s are small!

I’m but a pundit in this area so don’t know much. But one wonders if there’s a future in burning larger models to FPGAs - whether big enough FPGAs exist (or can be built), and whether locating specialized compute right with the memory it needs can speed things up.

Likely would need a lot of algorithm parallelism work that’d translate back to CPUs/GPUs.

genxy 3 hours ago [-]
The context window is 16 characters. Talking about tokens per second is meaningless.
dominotw 3 hours ago [-]
its not meaningless. there could be usecases like spell correction.
genxy 2 hours ago [-]
It is only interesting as an academic exercise in EDA design. Just like microGPT. For something with an n^2 complexity and advertising perf is clickbait.
amelius 4 hours ago [-]
See also:

https://rits.shanghai.nyu.edu/ai/karpathys-microgpt-on-fpga-...

TL;DR: The CPU implementation was 71x faster than the FPGA.

Note: model has only 4192 parameters.

hedgehog 3 hours ago [-]
That post is uninteresting both because they miss the point, and it's not clear a human was even involved to perceive a point to miss. Sure, with an unlimited transistor budget, power budget, and a design clocked at 4GHz fabbed on 5nm one of the best CPU design teams in the world can make a thing that is straight line faster than a one-person project running at 80MHz on a 20 year old 65nm FPGA. Any other answer would be extremely surprising.

Now, there are a bunch of interesting things about this project. Seeing the example of a tiny transformer running on FPGA is informative, and that it was apparently a pretty quick project for one person + robot assistance. Probably some transferable lessons for anyone else doing robo-FPGA development.

https://github.com/fguzman82/gateGPT/tree/main/

cyanydeez 3 hours ago [-]
yeah, then theres prompt loading too.

but anyone who can fit QWEN-3.6 35B with a sustained ~30 token/s and ~100k context with cache could print money as a hardware vendor.

upboundspiral 51 minutes ago [-]
with llama-cpp and offloading non-active experts (from MOE architecture) to cpu RAM, you can easily run 50 tok / s QWEN-3.6 35B on 8-12 GB of VRAM. KV cache is a few GB, experts are ~3-5 GB (assuming q8 quant from Unsloth for example).

You can scroll through r/localllama and find tons of people getting useable speeds out of Qwen 35B.

24 tok / second on an ancient 1080ti

https://old.reddit.com/r/LocalLLaMA/comments/1tcc7h5/24_toks...

100 tok / second on a 4070

https://old.reddit.com/r/LocalLLaMA/comments/1tjh7az/110_tok...

wmf 3 hours ago [-]
That just sounds like a 3090.
cyanydeez 1 hours ago [-]
not at the vram sizes that control how much context to load; also, GPUs arn't as effiecient as direct inference.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 20:12:47 GMT+0000 (UTC) with Wasmer Edge.