Large language models (LLMs) have taken over the world of AI, offering vast knowledge in an instant. However, to run these models natively, one typically needs a server or a PC with a powerful accelerator. Microsoft Research has introduced a new “1-bit” LLM with a two-billion parameter scale that can run on a CPU.
Microsoft’s 1-bit LLM is trained on a corpus of 4 trillion tokens and offers performance equivalent to leading open-weight, full-precision models of similar size, while being efficient in terms of memory, energy, and latency.
You can now run 100B parameter models on your local CPU without GPUs.
Microsoft finally open-sourced their 1-bit LLM inference framework called bitnet.cpp:
> 6.17x faster inference
> 82.2% less energy on CPUs
> Supports Llama3, Falcon3, and BitNet models pic.twitter.com/AGPOsUjlyB— Lior⚡ (@LiorOnAI) April 19, 2025
This model is capable of understanding language, mathematical reasoning, coding proficiency, and conversational abilities. It was trained from scratch and supports a maximum sequence length of 4,096 tokens. The activations are quantised to 8-bit integers, and the weights are quantised to ternary values {-1, 0, +1} using absmean quantisation during the forward pass.
The BitNet b1.58 2B4T is also an open-source model that requires only 0.4 GB of memory, whereas other similarly sized models typically require between 2 to 5 GB. It can run on a single ARM- or x86-based CPU. The Microsoft Research team is continuing to study this model to better understand its workings and explore how this technology can serve as an alternative to large language models that require expensive GPUs.
Interested users can download the BitNet b1.58 2B4T from Hugging Face and test it for themselves. This model outperforms other similarly sized language models such as LLaMA 3.2 1B, Gemma-3 1B, Qwen2.5 1.5B, SmolLM2 1.7B, and MiniCPM 2B.
© IE Online Media Services Pvt Ltd