This is a living document, in-which I track efficiency progress at the bottom of this article
“I fed a picture of donald duck to my computer and asked it to identify said duck. I then asked my computer to draw donald duck’s girlfriend in svg format on my shitty decade-old gtx 1080. It ran into an infinite loop since the number of polka dots kept exceeding its context window. So finally I decided to generate a textured 3D mesh from donald duck and it did so within 5 minutes without much error, then I approved for my computer to send it to my 3D printer”
This comment was not something I would have made 10 years ago, nor would have expected 5 years ago, even 3 years ago once LLM quantization techniques stopped being theoretical. Nor did I think it would have been a comment that I would have made 6 months ago. It’s not exactly surprising, except for the fact how quickly I was dragged into it, within the past few months, with my limited resources.
If we split the comment into sentences associated with years that it became believeable:
“… asked my computer to identify said duck”1 – 2021
“… draw donald duck’s girlfriend in svg format” – 2022
“… on my decade-old gtx 1080” – ~2023 - 2024
“… generate a textured 3D mesh from donald duck” – ~2023
“… within 5 minutes” – 2025
Now try saying that to most people in 2000.
Since publishing this post:
- February 2026: “Run Llama 70B on 24GB RTX 3090” https://github.com/xaskasdf/ntransformer (though this won’t run on an RTX 1080 since CUDA support was dropped)
- March 2026: “inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU and GPU”
https://github.com/microsoft/BitNet
- “bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices.”
- April 2026: “1-bit Bonsai”
https://github.com/PrismML-Eng/Bonsai-demo/tree/main
- “On raw benchmark averages, 1-bit Bonsai 8B remains competitive with leading 8B-class models, but it does so at just 1.15 GB memory footprint, roughly 12-14x smaller than its peers. […] On an RTX 4090, it reaches 368 tokens per second […] 1-bit Bonsai 8B uses substantially less energy than its 16-bit full-precision counterparts, delivering roughly 4-5x better energy efficiency. […] these gains come primarily from the reduced memory footprint of 1-bit models, not yet from fully exploiting the 1-bit structure of the weights during inference. In other words, Bonsai already delivers substantial advantages on hardware that was not built for this class of model. […] In linear layers such as MLPs, 1-bit weights make it possible to perform inference with little or no multiplication, replacing much of the computation with simple additions.”
- I am skeptical of organizations using their own benchmarks to demonstrate performance gains… but we will see; I can run this on my own hardware, and it supports CUDA 12 for my trusty ol’ 1080 RTX
-
… in natural language ↩︎