word2vec

Tensorflow implementation of word2vec of paper Distributed Representations of Words and Phrases and their Compositionality, Including 2 models: CBOW and Skip-Gram, both of which are using negative sampling to accelerate.

Requirements

Python 3.6
TensorFlow 1.8
numpy
...

Explanation of dirs and files

dirs
- data: store train data and evaluate data
- result: store well-trained word vector and visualization images
files
- const.py: store global constant
- dataset.py: download data, read file and build corpus
- 1_CBOW.py: build and train CBOW model
- 2_Skipgram.py: build and train Skip-Gram model
- visualization.py: visualization of word vectors of CBOW or Skip-Gram model

Data

For train:
I use text8(http://mattmahoney.net/dc/text8.zip). Its length is 17005207 and its vacab size is 253854.
For evaluation:
Based on question-word.txt(https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip), which contains 8869 semantic cases and 10676 syntactic cases.
I delete the cases whose word is not in my vocab and then get 506 semantic cases and 8946 syntactic cases to evaluation.

Train

How to train

train CBOW model: python 1_CBOW.py

train Skip-Gram model: python 2_Skipgram.py

Fine-tune params:
By now, the best accuracy I have got is about 10%, with params below:
- Dim of word vector: 30
- Num of negative samples: 100
- Batchsize: 256
- Winsize: 4(before:2, after:2)
I found some regularities of the params(to be written down), they are quite different from the paper. The reason may be the paper use big corpus, filter low-freq words and other details (need to do more experiment).
Comparison between CBOW and Skip-Gram
- CBOW is faster than Skip-Gram

Evaluate

Use the analogical reasoning tasks
By now, the best accuracy I have got is about 10% (to be improved).
Find the top-k similar words of the center word:
e.g.:

Nearest to two: three, five, one, nine, six,

Visualization

visualize of word vectors of CBOW model: python visualization.py -m=cbow
visualize of word vectors of Skip-Gram model: python visualization.py -m=skipgram

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

word2vec

Requirements

Explanation of dirs and files

Data

Train

Evaluate

Visualization

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
result		result
1_Cbow.py		1_Cbow.py
2_Skipgram.py		2_Skipgram.py
README.md		README.md
const.py		const.py
dataset.py		dataset.py
visualization.py		visualization.py

Folders and files

Latest commit

History

Repository files navigation

word2vec

Requirements

Explanation of dirs and files

Data

Train

Evaluate

Visualization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages