Skip to content

codeXiaoLi/word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

word2vec

Tensorflow implementation of word2vec of paper Distributed Representations of Words and Phrases and their Compositionality, Including 2 models: CBOW and Skip-Gram, both of which are using negative sampling to accelerate.

Requirements

  • Python 3.6
  • TensorFlow 1.8
  • numpy
  • ...

Explanation of dirs and files

  • dirs
    • data: store train data and evaluate data
    • result: store well-trained word vector and visualization images
  • files
    • const.py: store global constant
    • dataset.py: download data, read file and build corpus
    • 1_CBOW.py: build and train CBOW model
    • 2_Skipgram.py: build and train Skip-Gram model
    • visualization.py: visualization of word vectors of CBOW or Skip-Gram model

Data

Train

How to train

  • train CBOW model: python 1_CBOW.py
  • train Skip-Gram model: python 2_Skipgram.py
  • Fine-tune params:
    By now, the best accuracy I have got is about 10%, with params below:

    • Dim of word vector: 30
    • Num of negative samples: 100
    • Batchsize: 256
    • Winsize: 4(before:2, after:2)

    I found some regularities of the params(to be written down), they are quite different from the paper. The reason may be the paper use big corpus, filter low-freq words and other details (need to do more experiment).

  • Comparison between CBOW and Skip-Gram

    • CBOW is faster than Skip-Gram

Evaluate

  1. Use the analogical reasoning tasks
    By now, the best accuracy I have got is about 10% (to be improved).
  2. Find the top-k similar words of the center word:
    e.g.:

Nearest to two: three, five, one, nine, six,

Visualization

  • visualize of word vectors of CBOW model: python visualization.py -m=cbow
  • visualize of word vectors of Skip-Gram model: python visualization.py -m=skipgram image

Releases

No releases published

Packages

 
 
 

Contributors

Languages