Full Documentation: https://pycantonese.org
PyCantonese is a Python library for Cantonese linguistics and natural language processing (NLP). Currently implemented features (more to come!):
- Accessing and searching corpus data
- Parsing and conversion tools for Jyutping romanization
- Stop words
- Word segmentation
- Part-of-speech tagging
To download and install the stable, most recent version:
$ pip install --upgrade pycantonese
Ready for more? Check out the Quickstart page.
If your team would like professional assistance in using PyCantonese, technical consulting and training services are available. Please email Jackson L. Lee.
If you have found PyCantonese useful and would like to offer support, buying me a coffee would go a long way!
- Source code: https://github.com/jacksonllee/pycantonese
- Bug tracker: https://github.com/jacksonllee/pycantonese/issues
- Social media: Facebook and Twitter
PyCantonese is authored and maintained by Jackson L. Lee.
A talk introducing PyCantonese:
Lee, Jackson L. 2015. PyCantonese: Cantonese linguistic research in the age of big data. Talk at the Childhood Bilingualism Research Centre, Chinese University of Hong Kong. September 15. 2015. Notes+slides
MIT License. Please see LICENSE.txt in the GitHub source code for details.
The HKCanCor dataset included in PyCantonese is substantially modified from
its source in terms of format. The original dataset has a CC BY license.
Please see pycantonese/data/hkcancor/README.md
in the GitHub source code for details.
The rime-cantonese data (release 2021.05.16) is
incorporated into PyCantonese for word segmentation and
characters-to-Jyutping conversion.
This data has a CC BY 4.0 license.
Please see pycantonese/data/rime_cantonese/README.md
in the GitHub source code for details.
The PyCantonese logo is the Chinese character 粵 meaning Cantonese, with artistic design by albino.snowman (Instagram handle).
Wonderful resources with a permissive license that have been incorporated into PyCantonese:
- HKCanCor
- rime-cantonese
Individuals who have contributed feedback, bug reports, etc. (in alphabetical order of last names):
- @cathug
- Litong Chen
- Jenny Chim
- @g-traveller
- Rachel Han
- Ryan Lai
- Charles Lam
- Chaak Ming Lau
- Hill Ma
- @richielo
- @rylanchiu
- Stephan Stiller
- Tsz-Him Tsui
- Robin Yuen
Please see CHANGELOG.md.
The latest code under development is available on Github at jacksonllee/pycantonese. You need to have Git LFS installed on your system. To obtain this version for experimental features or for development:
$ git clone https://github.com/jacksonllee/pycantonese.git
$ cd pycantonese
$ git lfs pull
$ pip install -r dev-requirements.txt
$ pip install -e .To run tests and styling checks:
$ pytest -vv --doctest-modules --cov=pycantonese pycantonese docs/source
$ flake8 pycantonese
$ black --check pycantoneseTo build the documentation website files:
$ python docs/source/build_docs.py