K-Gram Language Model: C++17 Implementation

A high-performance implementation of k-gram language models in modern C++17, demonstrating efficient probabilistic text generation through transition probability matrices.

Author: Shek Lun Leung
Institution: Stockholm University
Focus: Computational Linguistics & High-Performance Computing

📚 Research Paper

This implementation is accompanied by a comprehensive research paper that provides:

Rigorous mathematical formalization with formal complexity proofs
Comprehensive algorithmic analysis
Experimental validation across multiple corpora
Literature review connecting classical and modern approaches
Research-level exposition suitable for academic publication

📄 Read the full paper (14 pages)

The paper presents a research-quality implementation, showcasing algorithmic design, systems programming, and high-performance probabilistic modeling expertise.

🎯 Overview

This program implements a statistical language model that:

Learns k-gram frequencies and transition probabilities from text corpora
Generates new text that follows similar statistical patterns
Demonstrates efficient C++17 systems programming with modern STL usage

Key Features

✅ Modern C++17 - RAII, move semantics, STL algorithms
✅ Efficient Sampling - Inverse transform sampling with O(n) complexity
✅ Nested Hash Maps - O(1) expected-time probability lookups
✅ Zero Raw Pointers - Full STL memory management
✅ High-Quality RNG - Mersenne Twister (std::mt19937)
✅ Robust Error Handling - Graceful handling of edge cases (empty input, dead ends)

🧮 Mathematical Model

K-Gram Probability

The model computes conditional probabilities using maximum likelihood estimation:

P(c | w) = N(w, c) / N(w)

Where:

w is a k-gram (substring of length k)
c is the next character
N(w, c) is the count of character c following k-gram w
N(w) is the total count of k-gram w

Complexity Analysis

Operation	Time Complexity	Space Complexity
Training	O(n·k)	O(n·k)
Generation	O(m·k)	O(1)

Where n = corpus size, k = k-gram length, m = output length

See the research paper for formal proofs.

🚀 Quick Start

Prerequisites

C++17 compatible compiler (g++ 7+, clang++ 5+)
GNU Make

Building

make

This creates an executable named slm in the current directory.

Usage

./slm <k> <input-file> <output-length>

Arguments:

k - Length of k-grams (must be positive integer)
input-file - Path to training text file
output-length - Number of characters to generate

Examples

Generate 500 characters with 5-grams from Shakespeare:

./slm 5 sample.txt 500

Short output with 2-grams:

./slm 2 space_poem.txt 200

High-order model with 10-grams:

./slm 10 sample.txt 1000

📁 Project Structure

.
├── README.md                # This file
├── Makefile                 # Build configuration
├── src/                     # Source code
│   ├── main.cpp            # Entry point and CLI
│   ├── KGramStats.h        # K-gram statistics interface
│   ├── KGramStats.cpp      # Probability computation & sampling
│   ├── LanguageModel.h     # Language model interface
│   └── LanguageModel.cpp   # Training & text generation
├── paper/                   # Research paper
│   ├── paper.pdf           # Compiled PDF (14 pages)
│   ├── paper.tex           # LaTeX source
│   └── PAPER_README.md     # Paper documentation
├── sample.txt               # Sample training data
├── space_poem.txt           # Example corpus
└── .gitignore              # Git ignore rules

🏗️ Implementation Details

Class Architecture

LanguageModel
├── int k                    # K-gram size
└── KGramStats stats         # Statistical model

KGramStats
├── int k                                      # K-gram size
├── map<string, int> kgramFreq                # K-gram frequencies
├── map<string, map<char, int>> transitions   # Transition counts
└── mt19937 gen                               # Random number generator

Core Algorithms

Training (KGramStats::addKGramTransition)

Slides a window of size k over input text
Records each k-gram and its successor character
Updates frequency and transition maps
Time: O(n·k), Space: O(V·k) where V = distinct k-grams

Sampling (KGramStats::getNextChar)

Normalizes transition counts to probabilities
Uses inverse transform sampling on discrete CDF
Returns sampled character from P(· | w)
Time: O(|Σ_w|) where Σ_w = chars following k-gram w

Generation (LanguageModel::generateText)

Samples initial k-gram weighted by frequency
Iteratively samples next character given current k-gram
Updates k-gram via sliding window (shift + append)
Robustness: Automatically handles "dead ends" (unseen contexts) by restarting from a random k-gram
Time: O(m·k), Space: O(k)

📊 Performance Characteristics

K-gram Size	Memory Usage	Perplexity	Best Use Case
k = 1	Low	High	Simple patterns
k = 3	Moderate	Medium	Balanced quality
k = 5	Moderate	Low	Recommended
k = 7	High	Very Low	Large corpora only
k = 10	Very High	Lowest	Overfitting risk

Note: Higher k-values produce more coherent text but require larger training data to avoid overfitting.

🔬 Research & Academic Use

This project demonstrates several competencies valuable for computational research:

Technical Skills

Algorithm Design - Efficient probabilistic sampling algorithms
Complexity Analysis - Rigorous theoretical bounds with formal proofs
Systems Programming - Memory-efficient data structures, optimal STL usage
Software Engineering - Clean architecture, encapsulation, maintainability

Applications

High-Performance Computing (HPC) - Simulation kernels
Probabilistic Programming - Bayesian inference engines
Monte Carlo Methods - Stochastic sampling
Text Analysis - Stylometry, authorship attribution

Research Context

This work represents an independent deep-dive into statistical language modeling and systems-level optimization, featuring:

Formal mathematical framework
Comprehensive experimental validation
Literature review and related work analysis
Publication-quality exposition

For full details, see the research paper.

🛠️ Development

Build Targets

make           # Build the slm executable
make clean     # Remove compiled files
make rebuild   # Clean and build

Compiler Flags

-std=c++17 - Enable C++17 features
-Wall -Wextra - Enable comprehensive warnings
-O3 - Optimization level 3 (for production)

Code Style

Modern C++ idioms (RAII, move semantics)
No raw pointers or manual memory management
STL containers and algorithms preferred
Const-correctness enforced

📖 Example Output

Input (k=5, sample.txt):

To be or not to be: that is the question...

Generated:

To be or not to be: that is the question: Whether 'tis nobler in the mind
to suffer The slings and arrows of outrageous fortune, Or to take arms
against a sea of troubles, And by opposing end them? To die: to sleep...

The model captures syntactic patterns, punctuation usage, and stylistic elements from the training corpus.

📚 References & Related Work

Classical Language Models

Shannon, C.E. (1948) - "A Mathematical Theory of Communication"
Brown et al. (1992) - "Class-Based N-gram Models of Natural Language"
Goodman (2001) - "A Bit of Progress in Language Modeling"

Implementation Studies

SRILM - Industry-standard toolkit (Stolcke, 2002)
KenLM - Optimized for speed and memory (Heafield, 2011)

Modern Approaches

Grave et al. (2016) - Improving neural models with n-gram caching
Neural language models (LSTMs, Transformers)

For complete bibliography, see the research paper.

🤝 Contributing

This is primarily an academic/portfolio project. For suggestions or issues:

Open an issue describing the problem or enhancement
For major changes, please discuss first

📄 License

This project is open source and available under the MIT License.

MIT License

Copyright (c) 2026 Shek Lun Leung

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

📧 Contact

Shek Lun Leung
Stockholm University
Email: [email protected]

For academic inquiries or collaboration opportunities, feel free to reach out.

🎓 Acknowledgments

This project was developed to investigate the intersection of classical statistical NLP and modern C++ systems programming. It serves as a case study in implementing efficient probabilistic data structures and rigorous complexity analysis for computational research.

🐍 Python Bindings & Benchmark

This project includes Python bindings using pybind11, allowing you to use the high-performance C++ model directly from Python. This demonstrates the "best of both worlds": C++ performance with Python's ease of use.

Prerequisites

pybind11 (pip install pybind11)
cmake (Install via your package manager, e.g., brew install cmake)

Building the Python Module

mkdir -p build && cd build
cmake ..
make
cp my_cpp_model*.so ..
cd ..

Usage

import my_cpp_model

# Initialize with k=5
model = my_cpp_model.LanguageModel(5)

# Train
model.train("Your training text here...")

# Generate
generated_text = model.generate_text(100)
print(generated_text)

Benchmarking

A benchmark script is included to compare the C++ implementation against an optimized pure Python equivalent.

python benchmark.py

Typical Results:

Training Speedup: >2.2x faster than Python
Generation Speedup: >4.5x faster than Python

⭐ If you find this project useful for learning or research, please consider starring the repository!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
build		build
paper		paper
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Makefile		Makefile
PUBLICATION_CHECKLIST.md		PUBLICATION_CHECKLIST.md
README.md		README.md
benchmark.py		benchmark.py
sample.txt		sample.txt
space_poem.txt		space_poem.txt

Folders and files

Latest commit

History

Repository files navigation

K-Gram Language Model: C++17 Implementation

📚 Research Paper

🎯 Overview

Key Features

🧮 Mathematical Model

K-Gram Probability

Complexity Analysis

🚀 Quick Start

Prerequisites

Building

Usage

Examples

📁 Project Structure

🏗️ Implementation Details

Class Architecture

Core Algorithms

Training (KGramStats::addKGramTransition)

Sampling (KGramStats::getNextChar)

Generation (LanguageModel::generateText)

📊 Performance Characteristics

🔬 Research & Academic Use

Technical Skills

Applications

Research Context

🛠️ Development

Build Targets

Compiler Flags

Code Style

📖 Example Output

📚 References & Related Work

Classical Language Models

Implementation Studies

Modern Approaches

🤝 Contributing

📄 License

📧 Contact

🎓 Acknowledgments

🐍 Python Bindings & Benchmark

Prerequisites

Building the Python Module

Usage

Benchmarking

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages