Before we get started, I would like to pay my respect and gratitude to Jeff Hawkins https://en.wikipedia.org/wiki/Jeff_Hawkins and his pioneering work in this field.
Sparse Distributed Representations (SDRs) are a data format based on the theoretical operations of the mammalian neocortex. Unlike the dense, continuous vector embeddings used in standard Deep Learning (e.g., Word2Vec, BERT), SDRs use massive, highly sparse binary vectors to encode semantic meaning.
An SDR is a long binary vector (typically
- Dense Embedding (Standard):
[0.23, -0.91, 0.55, ...](Continuous, Compact) - SDR (Neuromorphic):
[0, 0, 1, 0, 0, 0, 1, 0, ...](Binary, Vast, Sparse)
The semantic meaning is distributed across the pattern of active bits.
- No single bit represents "Cat".
- The set of bits
${5, 120, 900, ...}$ represents "Cat". - If a few bits are flipped due to noise, the semantic meaning remains intact because the overall pattern is still recognizable.
SDRs possess unique mathematical properties that make them ideal for robust AI architectures:
You can bundle multiple concepts into a single vector using a bitwise OR operation. Unlike dense vectors (where averaging mixes concepts into a blur), SDR unions retain the discrete identity of every component.
-
Capacity: Because the space is so vast (
$2^{2048}$ ), millions of unique patterns can coexist without colliding. - Retrieval: You can mathematically query the union: "Is 'Cat' inside this bundle?" by checking if the 'Cat' bits are active.
SDRs are inherently fault-tolerant.
- Signal Loss: You can turn off 50% of the active bits, and the system can still identify the original concept (the "Attractor Basin").
- Noise Injection: Random noise bits have little effect because they are unlikely to form a meaningful competing pattern by chance.
| Feature | Dense Embeddings (Standard Transformers) | Sparse Distributed Representations (SDR) |
|---|---|---|
| Data Type | Float32 / Float16 | Binary (0/1) |
| Dimensions | Low (e.g., 768, 4096) | High (e.g., 2048, 16384) |
| Activity | 100% Dense | ~2% Sparse |
| Combination | Vector Addition (Blurs meaning) | Boolean Union (Preserves distinct items) |
| Noise Tolerance | Low (Values change significantly) | Extremely High |
| Biologically Plausible? | No | Yes |
In this repository, we replace the standard dense embedding layer of an LLM with an SDR Retina. This allows the Transformer to:
- Compute Efficiently: Processing sparse binaries is computationally cheaper than dense floats.
- Compress Context: Utilizing the Superposition property to pack multiple tokens into a single input step (e.g., 4x Context Compression).
- Resist Noise: Maintaining high accuracy even when inputs are heavily corrupted.
Replacing Dense Embeddings with Sparse Distributed Representations (SDRs) in Large Language Models.
This project demonstrates that Sparse Distributed Representations (SDRs)—the data format used by the biological neocortex—can successfully replace dense vector embeddings in modern Transformers without sacrificing semantic capability.
By transplanting a generic Qwen 0.5B model with a new "SDR Retina" (Input Layer), we achieved a model that is:
- Semantically Competent: Capable of general chat and instruction following.
- Mathematically Robust: Solves arithmetic and logic puzzles (via GSM8k training).
- Noise Resilient: Maintains accuracy even with 50% input signal corruption.
- Compression Verified: We have experimentally verified the 4x Lossless Context Compression capability of SDRs (though the current model checkpoint is trained on standard sequential inputs).
Standard Transformers use dense lookup tables (Embeddings) to represent tokens. This project replaces that layer with a mathematical projection:
- Input: Token Indices are converted into Sparse Binary Vectors (Length 2048, ~2% sparsity).
-
The Retina: A fixed, random linear projection layer maps these sparse binary patterns into the model's dense hidden space (
$d=896$ ). - The Body: A standard Qwen 2.5 0.5B Transformer (pretrained weights preserved).
Flow: Token ID SDR (2048 bits) Linear Projection Transformer Block
We injected random noise into the input SDRs (flipping active bits to 0, inactive bits to 1).
- 0% Noise: Perfect accuracy.
- 20% Noise: Perfect semantic accuracy (Model stays stable).
- 50% Noise: Model still retrieves high-probability facts (e.g., "Capital of France is Paris") despite half the input data being destroyed.
Initial versions suffered from "Number Blindness" (seeing "25" as a fuzzy mix of "2" and "5").
- Phase 3 Training (GSM8k): Taught the Retina to distinguish multi-digit numbers.
- Result: The model now correctly solves
25 + 25 = 50and performs multi-step Chain-of-Thought reasoning.
We utilized the Union Property of SDRs to test packing multiple tokens into a single input vector using rotation-based positional encoding.
- Verification: Successfully packed 4 tokens ("The quick brown fox") into 1 SDR.
- Retrieval: Achieved 100% Lossless Retrieval of the original sequence from the single packed vector using mathematical decoding.
- Status: This property is verified in our experiment scripts. A "Turbo" model trained to read these packed inputs directly is planned for V2.
# 1. Clone the repository
git clone [https://github.com/your-repo/SDR_Transformer_PoC.git](https://github.com/your-repo/SDR_Transformer_PoC.git)
cd SDR_Transformer_PoC
# 2. Install dependencies
pip install torch transformers datasets accelerate