Skip to content

simonbravek/word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License: MIT

Word2Vec (NumPy)

Pure-NumPy implementation of Word2Vec training (skip-gram with negative sampling).

This repository is intentionally small and focused: it implements the optimization procedure end-to-end (forward pass, loss, gradients, parameter updates) without using PyTorch/TensorFlow or other ML frameworks.

What It Trains

  • Model: Skip-gram with negative sampling (SGNS)
  • Parameters: input embeddings w_in and output embeddings w_out (both float32)
  • Loss: negative sampling objective (implemented via a numerically stable softplus form)
  • Negative sampling: unigram distribution with exponent 0.75 (configurable)
  • Subsampling: frequent-word subsampling (configurable)

Dataset (NLTK Corpora)

The default training corpus is built from these NLTK corpora:

  • Brown
  • Reuters
  • Gutenberg

main.py loads tokens using nltk.corpus.{brown,reuters,gutenberg}.words() and concatenates them.

Download the datasets for training using:

source .venv/bin/activate
python3 -c "import nltk; nltk.download('brown'); nltk.download('reuters'); nltk.download('gutenberg')"
python3 main.py

Token preprocessing in word2vec.py:

  • lowercasing + strip()
  • keeps only tokens matching ^[a-z']+$
  • keeps the top max_size - 1 tokens by frequency and maps the rest to <UNK>

Setup

This project targets Python >=3.11 (tested on 3.14).

Option A: Install with uv (fast, uses uv.lock)

uv sync

Option B: Install with pip (no uv required)

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install $(python -c "import tomllib, pathlib; d=tomllib.loads(pathlib.Path('pyproject.toml').read_text()); print(' '.join(d['project']['dependencies']))")

Outputs

Training writes artifacts into output/ (not committed to git):

  • output/vectors.txt: learned input embeddings (w_in) in a text format (header: num_words vector_dim)
  • output/loss_history.txt: batch-averaged loss values written line-by-line
  • output/loss_plot.png: plot of loss vs total update steps

Project Structure

  • task.md: original task statement
  • main.py: builds the dataset (NLTK corpora) and runs training
  • config.py: hyperparameters and output paths
  • word2vec.py:
    • CorpusVocabulary: vocabulary building, token cleaning, subsampling distribution
    • Word2VecTrainer: SGNS training loop (forward/loss/gradients/updates)
    • helpers: negative sampling, sigmoid, softplus, save/plot utilities

Tests

Note that most of the tests were Ai generated to spare my own time. I have tested them and they seem working fine. More information about the tests in docs/tests.md

Notes on Performance

There is a separate numba branch where I experimented with speeding up the training loop using Numba (reported ~10x improvement). Note that the branch is under development.

Future improvements

There is a significant performance benefit in using Numba JIT on the CPU.

References

  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781 (2013). https://arxiv.org/abs/1301.3781

  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. "Distributed Representations of Words and Phrases and their Compositionality." arXiv:1310.4546 (2013). https://arxiv.org/abs/1310.4546

About

Simple NumPy Word2Vec implementation inspired by google research papers.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages