A high-performance implementation of k-gram language models in modern C++17, demonstrating efficient probabilistic text generation through transition probability matrices.
Author: Shek Lun Leung
Institution: Stockholm University
Focus: Computational Linguistics & High-Performance Computing
This implementation is accompanied by a comprehensive research paper that provides:
- Rigorous mathematical formalization with formal complexity proofs
- Comprehensive algorithmic analysis
- Experimental validation across multiple corpora
- Literature review connecting classical and modern approaches
- Research-level exposition suitable for academic publication
📄 Read the full paper (14 pages)
The paper presents a research-quality implementation, showcasing algorithmic design, systems programming, and high-performance probabilistic modeling expertise.
This program implements a statistical language model that:
- Learns k-gram frequencies and transition probabilities from text corpora
- Generates new text that follows similar statistical patterns
- Demonstrates efficient C++17 systems programming with modern STL usage
- ✅ Modern C++17 - RAII, move semantics, STL algorithms
- ✅ Efficient Sampling - Inverse transform sampling with O(n) complexity
- ✅ Nested Hash Maps - O(1) expected-time probability lookups
- ✅ Zero Raw Pointers - Full STL memory management
- ✅ High-Quality RNG - Mersenne Twister (std::mt19937)
- ✅ Robust Error Handling - Graceful handling of edge cases (empty input, dead ends)
The model computes conditional probabilities using maximum likelihood estimation:
P(c | w) = N(w, c) / N(w)
Where:
wis a k-gram (substring of length k)cis the next characterN(w, c)is the count of character c following k-gram wN(w)is the total count of k-gram w
| Operation | Time Complexity | Space Complexity |
|---|---|---|
| Training | O(n·k) | O(n·k) |
| Generation | O(m·k) | O(1) |
Where n = corpus size, k = k-gram length, m = output length
See the research paper for formal proofs.
- C++17 compatible compiler (g++ 7+, clang++ 5+)
- GNU Make
makeThis creates an executable named slm in the current directory.
./slm <k> <input-file> <output-length>Arguments:
k- Length of k-grams (must be positive integer)input-file- Path to training text fileoutput-length- Number of characters to generate
Generate 500 characters with 5-grams from Shakespeare:
./slm 5 sample.txt 500Short output with 2-grams:
./slm 2 space_poem.txt 200High-order model with 10-grams:
./slm 10 sample.txt 1000.
├── README.md # This file
├── Makefile # Build configuration
├── src/ # Source code
│ ├── main.cpp # Entry point and CLI
│ ├── KGramStats.h # K-gram statistics interface
│ ├── KGramStats.cpp # Probability computation & sampling
│ ├── LanguageModel.h # Language model interface
│ └── LanguageModel.cpp # Training & text generation
├── paper/ # Research paper
│ ├── paper.pdf # Compiled PDF (14 pages)
│ ├── paper.tex # LaTeX source
│ └── PAPER_README.md # Paper documentation
├── sample.txt # Sample training data
├── space_poem.txt # Example corpus
└── .gitignore # Git ignore rules
LanguageModel
├── int k # K-gram size
└── KGramStats stats # Statistical model
KGramStats
├── int k # K-gram size
├── map<string, int> kgramFreq # K-gram frequencies
├── map<string, map<char, int>> transitions # Transition counts
└── mt19937 gen # Random number generator
- Slides a window of size
kover input text - Records each k-gram and its successor character
- Updates frequency and transition maps
- Time: O(n·k), Space: O(V·k) where V = distinct k-grams
- Normalizes transition counts to probabilities
- Uses inverse transform sampling on discrete CDF
- Returns sampled character from P(· | w)
- Time: O(|Σ_w|) where Σ_w = chars following k-gram w
- Samples initial k-gram weighted by frequency
- Iteratively samples next character given current k-gram
- Updates k-gram via sliding window (shift + append)
- Robustness: Automatically handles "dead ends" (unseen contexts) by restarting from a random k-gram
- Time: O(m·k), Space: O(k)
| K-gram Size | Memory Usage | Perplexity | Best Use Case |
|---|---|---|---|
| k = 1 | Low | High | Simple patterns |
| k = 3 | Moderate | Medium | Balanced quality |
| k = 5 | Moderate | Low | Recommended |
| k = 7 | High | Very Low | Large corpora only |
| k = 10 | Very High | Lowest | Overfitting risk |
Note: Higher k-values produce more coherent text but require larger training data to avoid overfitting.
This project demonstrates several competencies valuable for computational research:
- Algorithm Design - Efficient probabilistic sampling algorithms
- Complexity Analysis - Rigorous theoretical bounds with formal proofs
- Systems Programming - Memory-efficient data structures, optimal STL usage
- Software Engineering - Clean architecture, encapsulation, maintainability
- High-Performance Computing (HPC) - Simulation kernels
- Probabilistic Programming - Bayesian inference engines
- Monte Carlo Methods - Stochastic sampling
- Text Analysis - Stylometry, authorship attribution
This work represents an independent deep-dive into statistical language modeling and systems-level optimization, featuring:
- Formal mathematical framework
- Comprehensive experimental validation
- Literature review and related work analysis
- Publication-quality exposition
For full details, see the research paper.
make # Build the slm executable
make clean # Remove compiled files
make rebuild # Clean and build-std=c++17- Enable C++17 features-Wall -Wextra- Enable comprehensive warnings-O3- Optimization level 3 (for production)
- Modern C++ idioms (RAII, move semantics)
- No raw pointers or manual memory management
- STL containers and algorithms preferred
- Const-correctness enforced
Input (k=5, sample.txt):
To be or not to be: that is the question...
Generated:
To be or not to be: that is the question: Whether 'tis nobler in the mind
to suffer The slings and arrows of outrageous fortune, Or to take arms
against a sea of troubles, And by opposing end them? To die: to sleep...
The model captures syntactic patterns, punctuation usage, and stylistic elements from the training corpus.
- Shannon, C.E. (1948) - "A Mathematical Theory of Communication"
- Brown et al. (1992) - "Class-Based N-gram Models of Natural Language"
- Goodman (2001) - "A Bit of Progress in Language Modeling"
- SRILM - Industry-standard toolkit (Stolcke, 2002)
- KenLM - Optimized for speed and memory (Heafield, 2011)
- Grave et al. (2016) - Improving neural models with n-gram caching
- Neural language models (LSTMs, Transformers)
For complete bibliography, see the research paper.
This is primarily an academic/portfolio project. For suggestions or issues:
- Open an issue describing the problem or enhancement
- For major changes, please discuss first
This project is open source and available under the MIT License.
MIT License
Copyright (c) 2026 Shek Lun Leung
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Shek Lun Leung
Stockholm University
Email: [email protected]
For academic inquiries or collaboration opportunities, feel free to reach out.
This project was developed to investigate the intersection of classical statistical NLP and modern C++ systems programming. It serves as a case study in implementing efficient probabilistic data structures and rigorous complexity analysis for computational research.
This project includes Python bindings using pybind11, allowing you to use the high-performance C++ model directly from Python. This demonstrates the "best of both worlds": C++ performance with Python's ease of use.
pybind11(pip install pybind11)cmake(Install via your package manager, e.g.,brew install cmake)
mkdir -p build && cd build
cmake ..
make
cp my_cpp_model*.so ..
cd ..import my_cpp_model
# Initialize with k=5
model = my_cpp_model.LanguageModel(5)
# Train
model.train("Your training text here...")
# Generate
generated_text = model.generate_text(100)
print(generated_text)A benchmark script is included to compare the C++ implementation against an optimized pure Python equivalent.
python benchmark.pyTypical Results:
- Training Speedup: >2.2x faster than Python
- Generation Speedup: >4.5x faster than Python
⭐ If you find this project useful for learning or research, please consider starring the repository!