Use PTT (bulletin board system (BBS) in Taiwan) and Chinese Wiki corpora to build count-based and prediction-based word embeddings.
The evaluations in similarity/relatedness tasks are better than the other pre-trained word embeddings.
Download
Chinese_word_embedding_count_based
| Hyperparameter | Setting |
|---|---|
| Frequency weighting | SPPMI_k10 |
| Window size | 3 |
| Dimensions | 700 |
| Remove first k dimensions | 6 |
| Weighting exponent | 0.5 |
| Discover new words | no |
Chinese_word_embedding_CBOW
| Hyperparameter | Setting |
|---|---|
| Window size | 2 |
| Dimensions | 500 |
| Model | CBOW |
| Learning rate | 0.025 |
| Sampling rate | 0.00001 |
| Negative samples | 2 |
| Discover new words | no |
If you use the Chinese word embedding in your works, please cite this paper:
Ying-Ren Chen (2021). Generate coherent text using semantic embedding, common sense templates and Monte-Carlo tree search methods (Master's thesis, National Tsing Hua University, Hsinchu, Taiwan).
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License.
