Pure-NumPy implementation of Word2Vec training (skip-gram with negative sampling).
This repository is intentionally small and focused: it implements the optimization procedure end-to-end (forward pass, loss, gradients, parameter updates) without using PyTorch/TensorFlow or other ML frameworks.
- Model: Skip-gram with negative sampling (SGNS)
- Parameters: input embeddings
w_inand output embeddingsw_out(bothfloat32) - Loss: negative sampling objective (implemented via a numerically stable softplus form)
- Negative sampling: unigram distribution with exponent 0.75 (configurable)
- Subsampling: frequent-word subsampling (configurable)
The default training corpus is built from these NLTK corpora:
- Brown
- Reuters
- Gutenberg
main.py loads tokens using nltk.corpus.{brown,reuters,gutenberg}.words() and concatenates them.
Download the datasets for training using:
source .venv/bin/activate
python3 -c "import nltk; nltk.download('brown'); nltk.download('reuters'); nltk.download('gutenberg')"
python3 main.pyToken preprocessing in word2vec.py:
- lowercasing +
strip() - keeps only tokens matching
^[a-z']+$ - keeps the top
max_size - 1tokens by frequency and maps the rest to<UNK>
This project targets Python >=3.11 (tested on 3.14).
uv syncpython3 -m venv .venv
source .venv/bin/activate
python3 -m pip install $(python -c "import tomllib, pathlib; d=tomllib.loads(pathlib.Path('pyproject.toml').read_text()); print(' '.join(d['project']['dependencies']))")Training writes artifacts into output/ (not committed to git):
output/vectors.txt: learned input embeddings (w_in) in a text format (header:num_words vector_dim)output/loss_history.txt: batch-averaged loss values written line-by-lineoutput/loss_plot.png: plot of loss vs total update steps
task.md: original task statementmain.py: builds the dataset (NLTK corpora) and runs trainingconfig.py: hyperparameters and output pathsword2vec.py:CorpusVocabulary: vocabulary building, token cleaning, subsampling distributionWord2VecTrainer: SGNS training loop (forward/loss/gradients/updates)- helpers: negative sampling,
sigmoid,softplus, save/plot utilities
Note that most of the tests were Ai generated to spare my own time. I have tested them and they seem working fine.
More information about the tests in docs/tests.md
There is a separate numba branch where I experimented with speeding up the training loop using Numba (reported ~10x improvement).
Note that the branch is under development.
There is a significant performance benefit in using Numba JIT on the CPU.
-
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781 (2013). https://arxiv.org/abs/1301.3781
-
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. "Distributed Representations of Words and Phrases and their Compositionality." arXiv:1310.4546 (2013). https://arxiv.org/abs/1310.4546