Word2Vec (NumPy)

Pure-NumPy implementation of Word2Vec training (skip-gram with negative sampling).

This repository is intentionally small and focused: it implements the optimization procedure end-to-end (forward pass, loss, gradients, parameter updates) without using PyTorch/TensorFlow or other ML frameworks.

What It Trains

Model: Skip-gram with negative sampling (SGNS)
Parameters: input embeddings w_in and output embeddings w_out (both float32)
Loss: negative sampling objective (implemented via a numerically stable softplus form)
Negative sampling: unigram distribution with exponent 0.75 (configurable)
Subsampling: frequent-word subsampling (configurable)

Dataset (NLTK Corpora)

The default training corpus is built from these NLTK corpora:

Brown
Reuters
Gutenberg

main.py loads tokens using nltk.corpus.{brown,reuters,gutenberg}.words() and concatenates them.

Download the datasets for training using:

source .venv/bin/activate
python3 -c "import nltk; nltk.download('brown'); nltk.download('reuters'); nltk.download('gutenberg')"
python3 main.py

Token preprocessing in word2vec.py:

lowercasing + strip()
keeps only tokens matching ^[a-z']+$
keeps the top max_size - 1 tokens by frequency and maps the rest to <UNK>

Setup

This project targets Python >=3.11 (tested on 3.14).

Option A: Install with `uv` (fast, uses `uv.lock`)

uv sync

Option B: Install with `pip` (no `uv` required)

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install $(python -c "import tomllib, pathlib; d=tomllib.loads(pathlib.Path('pyproject.toml').read_text()); print(' '.join(d['project']['dependencies']))")

Outputs

Training writes artifacts into output/ (not committed to git):

output/vectors.txt: learned input embeddings (w_in) in a text format (header: num_words vector_dim)
output/loss_history.txt: batch-averaged loss values written line-by-line
output/loss_plot.png: plot of loss vs total update steps

Project Structure

task.md: original task statement
main.py: builds the dataset (NLTK corpora) and runs training
config.py: hyperparameters and output paths
word2vec.py:
- CorpusVocabulary: vocabulary building, token cleaning, subsampling distribution
- Word2VecTrainer: SGNS training loop (forward/loss/gradients/updates)
- helpers: negative sampling, sigmoid, softplus, save/plot utilities

Tests

Note that most of the tests were Ai generated to spare my own time. I have tested them and they seem working fine. More information about the tests in docs/tests.md

Notes on Performance

There is a separate numba branch where I experimented with speeding up the training loop using Numba (reported ~10x improvement). Note that the branch is under development.

Future improvements

There is a significant performance benefit in using Numba JIT on the CPU.

References

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781 (2013). https://arxiv.org/abs/1301.3781
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. "Distributed Representations of Words and Phrases and their Compositionality." arXiv:1310.4546 (2013). https://arxiv.org/abs/1310.4546

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word2Vec (NumPy)

What It Trains

Dataset (NLTK Corpora)

Setup

Option A: Install with `uv` (fast, uses `uv.lock`)

Option B: Install with `pip` (no `uv` required)

Outputs

Project Structure

Tests

Notes on Performance

Future improvements

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
main.py		main.py
pyproject.toml		pyproject.toml
task.md		task.md
tests.py		tests.py
uv.lock		uv.lock
word2vec.py		word2vec.py

Folders and files

Latest commit

History

Repository files navigation

Word2Vec (NumPy)

What It Trains

Dataset (NLTK Corpora)

Setup

Option A: Install with uv (fast, uses uv.lock)

Option B: Install with pip (no uv required)

Outputs

Project Structure

Tests

Notes on Performance

Future improvements

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Option A: Install with `uv` (fast, uses `uv.lock`)

Option B: Install with `pip` (no `uv` required)

Packages