Korean Text Mining

This program is developed using Python and KoNLPy as text mining library. KoNLPy is a Python package for natural language processing (NLP) of the Korean language.

Text Mining Process

Data Collection
- Newspaper: Joongang Ilbo
- Web Scrapping: Using Beautiful Soup
Data Preprocessing
- Tokenization
Data Representation
- Word2Vec
- PCA Plot
Data Analysis

Prerequisites

Download and Install Python on your workstation
Install BeautifulSoup library
Install requests module using pip install request
Install KoNLPy
Install related libraries (matplotlib, gensim, nltk)

Source Code Description

joongang.py - web scrapping
morph.py - morphing code, such as word2vec
joongang.txt - result file from scrapping

Results

Following is the result of the most similar word to 금융 or finance word in Korean

등
0.954851508140564
중앙
0.9336299300193787
것
0.9300297498703003
한국
0.9291011691093445
검사
0.9279869198799133
형사
0.9176149368286133
일본
0.9146363735198975
시장
0.9116146564483643
조치
0.9072641134262085
기업
0.8989660739898682

notes

Originally I developed the program on Mac OS, but I have tried to run also on Windows OS. And work well by changing several line of code, especially the file location.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.vscode		.vscode
LICENSE		LICENSE
README.md		README.md
joongang.py		joongang.py
joongang.txt		joongang.txt
joongang_w2v.model		joongang_w2v.model
morph.py		morph.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Korean Text Mining

Text Mining Process

Prerequisites

Source Code Description

Results

notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Korean Text Mining

Text Mining Process

Prerequisites

Source Code Description

Results

notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages