Lightweight, powerful, and easy-to-use Indonesian text preprocessing library
Installation • Quick Start • Features • Examples • Documentation
- Overview
- Installation
- Quick Start
- Features
- Comprehensive Examples
- Pipeline Configuration Options
- API Documentation
- Development
- Contributing
- License
nahiarhdNLP adalah library Python yang dirancang khusus untuk preprocessing teks Bahasa Indonesia. Library ini menyediakan berbagai fungsi untuk membersihkan, menormalisasi, dan memproses teks dengan mudah dan efisien.
- 🔧 Configurable Pipeline - Build custom text processing workflows
- 🧹 Comprehensive Text Cleaning - Remove HTML, URLs, mentions, hashtags, emojis, and more
- 📝 Text Normalization - Emoji conversion, spell correction, slang normalization
- 🔤 Linguistic Processing - Stemming, stopword removal, tokenization
- 🔄 Text Replacement - Replace emails, links, and mentions with tokens
- 📊 Built-in Datasets - Indonesian stopwords, slang dictionary, emoji mappings
- ⚡ High Performance - Lazy loading and optimized processing
- 🎯 Easy to Use - Simple, intuitive API
pip install nahiarhdNLPgit clone https://github.com/raihanhd12/nahiarhdNLP.git
cd nahiarhdNLP
pip install -e .- Python >= 3.8
- pandas >= 1.3.0
- sastrawi >= 1.0.1
- rich >= 12.0.0
from nahiarhdNLP.preprocessing import Pipeline
# Create a pipeline with configuration
config = {
"clean_html": True,
"clean_mentions": True,
"remove_urls": True,
"stopword": True
}
pipeline = Pipeline(config)
# Process text
text = "Haii @user!! Cek website kita di https://example.com ya 😊"
result = pipeline.process(text)
print(result)
# Output: "Haii Cek website kita ya 😊"| Feature | Description | Config Key |
|---|---|---|
| HTML Removal | Remove HTML tags | clean_html |
| URL Removal | Remove complete URLs | remove_urls |
| URL Cleaning | Remove URL protocols only | clean_urls |
| Mention Removal | Remove @mentions | remove_mentions |
| Mention Cleaning | Remove @ but keep username | clean_mentions |
| Hashtag Removal | Remove #hashtags | remove_hashtags |
| Hashtag Cleaning | Remove # but keep tag text | clean_hashtags |
| Emoji Removal | Remove all emojis | remove_emoji |
| Punctuation Removal | Remove punctuation marks | remove_punctuation |
| Number Removal | Remove all numbers | remove_numbers |
| Email Removal | Remove email addresses | remove_emails |
| Phone Removal | Remove phone numbers | remove_phones |
| Currency Removal | Remove currency symbols | remove_currency |
| Special Char Removal | Remove special characters | remove_special_chars |
| Extra Spaces | Normalize whitespace | remove_extra_spaces |
| Repeated Chars | Normalize repeated chars | remove_repeated_chars |
| Whitespace Cleaning | Clean tabs, newlines, etc. | remove_whitespace |
| Feature | Description | Config Key |
|---|---|---|
| Emoji to Text | Convert emojis to text | emoji_to_text |
| Text to Emoji | Convert text to emojis | text_to_emoji |
| Spell Correction (Word) | Correct spelling & slang (single word) | spell_corrector_word |
| Spell Correction (Sentence) | Correct spelling & slang (full sentence) | spell_corrector_sentence |
| Lowercase | Convert to lowercase | remove_lowercase |
| Feature | Description | Config Key |
|---|---|---|
| Stemming | Reduce words to root form | stem |
| Stopword Removal | Remove Indonesian stopwords | stopword |
| Tokenization | Split text into tokens | tokenizer |
| Feature | Description | Config Key |
|---|---|---|
| Email Replacement | Replace emails with <email> |
replace_email |
| Link Replacement | Replace URLs with <link> |
replace_link |
| User Replacement | Replace mentions with <user> |
replace_user |
from nahiarhdNLP.preprocessing import Pipeline
# Configure pipeline
config = {
"clean_html": True,
"clean_mentions": True,
"remove_urls": True
}
pipeline = Pipeline(config)
# Input
text = "Hello <b>World</b>! Mention @user123 and visit https://example.com"
# Process
result = pipeline.process(text)
print(f"Input : {text}")
print(f"Output: {result}")Output:
Input : Hello <b>World</b>! Mention @user123 and visit https://example.com
Output: Hello World! Mention user123 and visit
from nahiarhdNLP.preprocessing import Pipeline
config = {
"clean_html": True,
"clean_mentions": True,
"clean_hashtags": True,
"remove_urls": True,
"remove_emoji": True,
"remove_extra_spaces": True
}
pipeline = Pipeline(config)
# Input - Typical social media post
text = """
Haiii gengs!! 😍😍
Jangan lupa follow @nahiarhdNLP ya!
Cek website kita di https://github.com/nahiarhd
#NLP #IndonesianNLP #TextProcessing 🚀
"""
result = pipeline.process(text)
print("=" * 60)
print("INPUT:")
print(text)
print("=" * 60)
print("OUTPUT:")
print(result)
print("=" * 60)Output:
============================================================
INPUT:
Haiii gengs!! 😍😍
Jangan lupa follow @nahiarhdNLP ya!
Cek website kita di https://github.com/nahiarhd
#NLP #IndonesianNLP #TextProcessing 🚀
============================================================
OUTPUT:
Haiii gengs!! Jangan lupa follow nahiarhdNLP ya! Cek website kita di NLP IndonesianNLP TextProcessing
============================================================
from nahiarhdNLP.preprocessing import Pipeline
# Initial configuration
config = {"clean_html": True, "remove_urls": True}
pipeline = Pipeline(config)
text = "<p>Visit https://example.com for more info</p>"
print(f"Initial Output: {pipeline.process(text)}")
# Output: Visit for more info
# Update configuration
pipeline.update_config({"remove_punctuation": True})
print(f"Updated Output: {pipeline.process(text)}")
# Output: Visit for more info
# Check enabled steps
print(f"Enabled steps: {pipeline.get_enabled_steps()}")
# Output: ['clean_html', 'remove_urls', 'remove_punctuation']from nahiarhdNLP.preprocessing import Pipeline
# Get all available features
all_features = Pipeline.get_available_steps()
print("All Available Features:")
for feature_name, description in sorted(all_features.items()):
print(f" {feature_name:25} - {description}")
print(f"\nTotal Features: {len(all_features)}")
# Get features organized by category
features_by_category = Pipeline.get_available_steps_by_category()
print("\nFeatures by Category:")
for category, feature_names in features_by_category.items():
print(f"\n{category}:")
for feature_name in feature_names:
description = all_features.get(feature_name, "No description")
print(f" {feature_name:25} - {description}")Output:
All Available Features:
clean_hashtags - Remove # symbol but keep tag text
clean_html - Remove HTML tags from text
clean_mentions - Remove @ symbol but keep username
clean_urls - Remove URL protocols (http://, https://) but keep domain
emoji_to_text - Convert emojis to Indonesian text description
remove_currency - Remove currency symbols
remove_emails - Remove email addresses
remove_emoji - Remove all emoji characters
... (28 features total)
Total Features: 28
Features by Category:
HTML & Tags:
clean_html - Remove HTML tags from text
URLs:
remove_urls - Remove complete URLs from text
clean_urls - Remove URL protocols (http://, https://) but keep domain
... (8 categories total)
from nahiarhdNLP.preprocessing import Pipeline
config = {"clean_html": True}
pipeline = Pipeline(config)
# Test various HTML tags
examples = [
"<p>This is a paragraph</p>",
"<div class='container'>Content here</div>",
"Normal text <b>bold text</b> <i>italic</i>",
"<script>alert('test')</script>Clean text"
]
for text in examples:
result = pipeline.process(text)
print(f"Input : {text}")
print(f"Output: {result}")
print("-" * 60)Output:
Input : <p>This is a paragraph</p>
Output: This is a paragraph
------------------------------------------------------------
Input : <div class='container'>Content here</div>
Output: Content here
------------------------------------------------------------
Input : Normal text <b>bold text</b> <i>italic</i>
Output: Normal text bold text italic
------------------------------------------------------------
Input : <script>alert('test')</script>Clean text
Output: Clean text
------------------------------------------------------------
from nahiarhdNLP.preprocessing import Pipeline
# Remove URLs completely
config_remove = {"remove_urls": True}
pipeline_remove = Pipeline(config_remove)
# Clean URLs (remove protocol only)
config_clean = {"clean_urls": True}
pipeline_clean = Pipeline(config_clean)
text = "Visit https://github.com and http://example.com for more info"
print(f"Original : {text}")
print(f"Remove URLs : {pipeline_remove.process(text)}")
print(f"Clean URLs : {pipeline_clean.process(text)}")Output:
Original : Visit https://github.com and http://example.com for more info
Remove URLs : Visit and for more info
Clean URLs : Visit github.com and example.com for more info
from nahiarhdNLP.preprocessing import Pipeline
text = "Hey @john_doe and @jane! Check out #Python #MachineLearning #AI"
# Remove mentions and hashtags
config_remove = {"remove_mentions": True, "remove_hashtags": True}
pipeline_remove = Pipeline(config_remove)
# Clean mentions and hashtags (keep text)
config_clean = {"clean_mentions": True, "clean_hashtags": True}
pipeline_clean = Pipeline(config_clean)
print(f"Original : {text}")
print(f"Remove @# : {pipeline_remove.process(text)}")
print(f"Clean @# (keep) : {pipeline_clean.process(text)}")Output:
Original : Hey @john_doe and @jane! Check out #Python #MachineLearning #AI
Remove @# : Hey and ! Check out
Clean @# (keep) : Hey john_doe and jane! Check out Python MachineLearning AI
from nahiarhdNLP.preprocessing import Pipeline
config = {"remove_emoji": True}
pipeline = Pipeline(config)
examples = [
"I love Python 🐍❤️",
"Great work! 👍😊🎉",
"Weather today ☀️🌧️⛈️",
]
for text in examples:
result = pipeline.process(text)
print(f"Input : {text}")
print(f"Output: {result}")
print()Output:
Input : I love Python 🐍❤️
Output: I love Python
Input : Great work! 👍😊🎉
Output: Great work!
Input : Weather today ☀️🌧️⛈️
Output: Weather today
from nahiarhdNLP.preprocessing import Pipeline
config = {"remove_repeated_chars": True}
pipeline = Pipeline(config)
examples = [
"Haiiiii guys!!!",
"Kangennnnn bangetttt",
"Wowwwww kerennn",
"Makasiiih yaaaa"
]
for text in examples:
result = pipeline.process(text)
print(f"Input : {text}")
print(f"Output: {result}")Output:
Input : Haiiiii guys!!!
Output: Haiii guys!!
Input : Kangennnnn bangetttt
Output: Kangenn bangett
Input : Wowwwww kerennn
Output: Wowww kerenn
Input : Makasiiih yaaaa
Output: Makasiih yaa
from nahiarhdNLP.preprocessing.normalization.emoji import EmojiConverter
emoji = EmojiConverter()
emoji._load_data()
# Emoji to Text
text_with_emoji = "Hari ini cuaca cerah ☀️ dan saya senang 😊"
result = emoji.emoji_to_text_convert(text_with_emoji)
print(f"Emoji to Text:")
print(f"Input : {text_with_emoji}")
print(f"Output: {result}")
print()
# Text to Emoji (example - depends on your emoji dataset)
text = "saya senang wajah tersenyum"
result = emoji.text_to_emoji_convert(text)
print(f"Text to Emoji:")
print(f"Input : {text}")
print(f"Output: {result}")Output:
Emoji to Text:
Input : Hari ini cuaca cerah ☀️ dan saya senang 😊
Output: Hari ini cuaca cerah matahari dan saya senang wajah_tersenyum
Text to Emoji:
Input : saya senang wajah tersenyum
Output: saya senang 😊
from nahiarhdNLP.preprocessing.normalization.spell_corrector import SpellCorrector
spell = SpellCorrector()
# Single word correction
words = ["sya", "tdk", "gk", "org", "yg", "dgn"]
print("Word Correction:")
for word in words:
corrected = spell.correct_word(word)
print(f" {word:10s} → {corrected}")
print("\n" + "="*60 + "\n")
# Sentence correction
sentences = [
"gw lg di rmh",
"gmn kabar lo?",
"knp gk dtg?",
"jgn lupa ya"
]
print("Sentence Correction:")
for sent in sentences:
corrected = spell.correct_sentence(sent)
print(f"Input : {sent}")
print(f"Output: {corrected}")
print()Output:
Word Correction:
sya → saya
tdk → tidak
gk → tidak
org → orang
yg → yang
dgn → dengan
============================================================
Sentence Correction:
Input : gw lg di rmh
Output: gue lagi di rumah
Input : gmn kabar lo?
Output: gimana kabar kamu?
Input : knp gk dtg?
Output: kenapa tidak datang?
Input : jgn lupa ya
Output: jangan lupa ya
from nahiarhdNLP.preprocessing import Pipeline
# Comprehensive normalization pipeline
config = {
"clean_html": True,
"clean_mentions": True,
"clean_hashtags": True,
"remove_urls": True,
"remove_emoji": True,
"remove_extra_spaces": True,
"remove_repeated_chars": True,
"spell_corrector_sentence": True,
"remove_lowercase": True
}
pipeline = Pipeline(config)
# Messy Indonesian text
text = """
Haiii @temans!! 😍 Kmrn gw udh coba apps baruu loh di https://example.com
#KerenBanget #Recommended Gkkkk nyesel dehhhh!!! 🚀🚀
"""
result = pipeline.process(text)
print("=" * 70)
print("ORIGINAL TEXT:")
print(text)
print("=" * 70)
print("NORMALIZED TEXT:")
print(result)
print("=" * 70)Output:
======================================================================
ORIGINAL TEXT:
Haiii @temans!! 😍 Kmrn gw udh coba apps baruu loh di https://example.com
#KerenBanget #Recommended Gkkkk nyesel dehhhh!!! 🚀🚀
======================================================================
NORMALIZED TEXT:
haiii temans!! kemarin gue sudah coba apps baruu loh di kerenbangett recommendedd gkk nyesell dehh!!!
======================================================================
from nahiarhdNLP.preprocessing.linguistic.stemmer import Stemmer
stemmer = Stemmer()
# Test various Indonesian words
words = [
"bermain", # playing
"berlari", # running
"kebahagiaan", # happiness
"pembelajaran", # learning
"menyenangkan", # enjoyable
"berkomunikasi" # communicate
]
print("Indonesian Stemming:")
print(f"{'Word':<20} → {'Stem'}")
print("-" * 40)
for word in words:
stem = stemmer.stem(word)
print(f"{word:<20} → {stem}")
print("\n" + "="*60 + "\n")
# Sentence stemming
sentences = [
"Saya sedang belajar pemrograman Python",
"Mereka bermain bola di lapangan",
"Kebahagiaan adalah kunci kesuksesan"
]
print("Sentence Stemming:")
for sent in sentences:
stemmed = stemmer.stem(sent)
print(f"Input : {sent}")
print(f"Output: {stemmed}")
print()Output:
Indonesian Stemming:
Word → Stem
----------------------------------------
bermain → main
berlari → lari
kebahagiaan → bahagia
pembelajaran → ajar
menyenangkan → senang
berkomunikasi → komunikasi
============================================================
Sentence Stemming:
Input : Saya sedang belajar pemrograman Python
Output: saya sedang ajar program python
Input : Mereka bermain bola di lapangan
Output: mereka main bola di lapang
Input : Kebahagiaan adalah kunci kesuksesan
Output: bahagia adalah kunci sukses
from nahiarhdNLP.preprocessing.linguistic.stopword import StopwordRemover
stopword = StopwordRemover()
stopword._load_data()
# Test sentences
sentences = [
"Saya sedang belajar bahasa pemrograman Python untuk data science",
"Mereka akan pergi ke pasar besok pagi",
"Ini adalah contoh kalimat dengan banyak stopwords yang harus dihapus"
]
print("Stopword Removal:")
print("=" * 70)
for sent in sentences:
cleaned = stopword.remove_stopwords(sent)
print(f"Original: {sent}")
print(f"Cleaned : {cleaned}")
print("-" * 70)Output:
Stopword Removal:
======================================================================
Original: Saya sedang belajar bahasa pemrograman Python untuk data science
Cleaned : belajar bahasa pemrograman Python data science
----------------------------------------------------------------------
Original: Mereka akan pergi ke pasar besok pagi
Cleaned : pasar besok pagi
----------------------------------------------------------------------
Original: Ini adalah contoh kalimat dengan banyak stopwords yang harus dihapus
Cleaned : contoh kalimat stopwords dihapus
----------------------------------------------------------------------
from nahiarhdNLP.preprocessing import Pipeline
# Linguistic processing pipeline
config = {
"remove_lowercase": True,
"stopword": True,
"stem": True,
"remove_extra_spaces": True
}
pipeline = Pipeline(config)
texts = [
"Saya sedang mengembangkan aplikasi pembelajaran online",
"Mereka bermain musik dengan sangat menyenangkan",
"Kebahagiaan adalah perjalanan bukan tujuan"
]
print("Complete Linguistic Processing:")
print("=" * 70)
for text in texts:
result = pipeline.process(text)
print(f"Original : {text}")
print(f"Processed: {result}")
print("-" * 70)Output:
Complete Linguistic Processing:
======================================================================
Original : Saya sedang mengembangkan aplikasi pembelajaran online
Processed: kembang aplikasi ajar online
----------------------------------------------------------------------
Original : Mereka bermain musik dengan sangat menyenangkan
Processed: main musik senang
----------------------------------------------------------------------
Original : Kebahagiaan adalah perjalanan bukan tujuan
Processed: bahagia jalan tuju
----------------------------------------------------------------------
from nahiarhdNLP.preprocessing.tokenization.tokenizer import Tokenizer
tokenizer = Tokenizer()
texts = [
"Ini adalah contoh kalimat sederhana",
"Python, Java, dan JavaScript adalah bahasa pemrograman",
"Email: [email protected], Website: https://example.com"
]
print("Tokenization Examples:")
print("=" * 70)
for text in texts:
tokens = tokenizer.tokenize(text)
print(f"Text : {text}")
print(f"Tokens: {tokens}")
print("-" * 70)Output:
Tokenization Examples:
======================================================================
Text : Ini adalah contoh kalimat sederhana
Tokens: ['Ini', 'adalah', 'contoh', 'kalimat', 'sederhana']
----------------------------------------------------------------------
Text : Python, Java, dan JavaScript adalah bahasa pemrograman
Tokens: ['Python', ',', 'Java', ',', 'dan', 'JavaScript', 'adalah', 'bahasa', 'pemrograman']
----------------------------------------------------------------------
Text : Email: [email protected], Website: https://example.com
Tokens: ['Email', ':', '[email protected]', ',', 'Website', ':', 'https://example.com']
----------------------------------------------------------------------
from nahiarhdNLP.preprocessing import Pipeline
# Configure replacement pipeline
config = {
"replace_email": True,
"replace_link": True,
"replace_user": True
}
pipeline = Pipeline(config)
examples = [
"Contact me at [email protected] for more info",
"Visit https://github.com/nahiarhd for the code",
"Thanks @john and @jane for your help!",
"Email: [email protected] | Web: https://company.com | Twitter: @company"
]
print("Text Replacement:")
print("=" * 70)
for text in examples:
result = pipeline.process(text)
print(f"Input : {text}")
print(f"Output: {result}")
print("-" * 70)Output:
Text Replacement:
======================================================================
Input : Contact me at [email protected] for more info
Output: Contact me at <email> for more info
----------------------------------------------------------------------
Input : Visit https://github.com/nahiarhd for the code
Output: Visit <link> for the code
----------------------------------------------------------------------
Input : Thanks @john and @jane for your help!
Output: Thanks <user> and <user> for your help!
----------------------------------------------------------------------
Input : Email: [email protected] | Web: https://company.com | Twitter: @company
Output: Email: <email> | Web: <link> | Twitter: <user>
----------------------------------------------------------------------
from nahiarhdNLP.preprocessing import Pipeline
# Complete anonymization pipeline
config = {
"replace_email": True,
"replace_link": True,
"replace_user": True,
"remove_phones": True,
"clean_html": True
}
pipeline = Pipeline(config)
# Sensitive data example
text = """
<div class="contact">
Customer: @johndoe
Email: [email protected]
Phone: +62-812-3456-7890
Website: https://customer-site.com
</div>
"""
result = pipeline.process(text)
print("DATA ANONYMIZATION")
print("=" * 70)
print("ORIGINAL:")
print(text)
print("=" * 70)
print("ANONYMIZED:")
print(result)
print("=" * 70)Output:
DATA ANONYMIZATION
======================================================================
ORIGINAL:
<div class="contact">
Customer: @johndoe
Email: [email protected]
Phone: +62-812-3456-7890
Website: https://customer-site.com
</div>
======================================================================
ANONYMIZED:
Customer: <user> Email: <email> Phone: Website: <link>
======================================================================
from nahiarhdNLP.datasets import DatasetLoader
loader = DatasetLoader()
# Load stopwords
stopwords = loader.load_stopwords_dataset()
print(f"📚 Stopwords Dataset:")
print(f" Total words: {len(stopwords)}")
print(f" Sample: {stopwords[:10]}")
print()
# Load slang dictionary
slang_dict = loader.load_slang_dataset()
print(f"💬 Slang Dictionary:")
print(f" Total entries: {len(slang_dict)}")
print(f" Sample mappings:")
for slang, formal in list(slang_dict.items())[:5]:
print(f" {slang:10s} → {formal}")
print()
# Load emoji dictionary
emoji_dict = loader.load_emoji_dataset()
print(f"😊 Emoji Dictionary:")
print(f" Total emojis: {len(emoji_dict)}")
print(f" Sample mappings:")
for emoji, text in list(emoji_dict.items())[:5]:
print(f" {emoji:5s} → {text}")
print()
# Load wordlist
wordlist = loader.load_wordlist_dataset()
print(f"📖 Wordlist Dataset:")
print(f" Total words: {len(wordlist)}")
print(f" Sample: {wordlist[:10]}")Output:
📚 Stopwords Dataset:
Total words: 758
Sample: ['ada', 'adalah', 'adanya', 'adapun', 'agak', 'agaknya', 'agar', 'akan', 'akankah', 'akhir']
💬 Slang Dictionary:
Total entries: 3592
Sample mappings:
gw → gue
lo → kamu
gak → tidak
yg → yang
dgn → dengan
😊 Emoji Dictionary:
Total emojis: 1800
Sample mappings:
😀 → wajah_tersenyum
😁 → wajah_gembira
😂 → tertawa_terbahak
🤣 → tertawa_guling
😃 → senyum_lebar
📖 Wordlist Dataset:
Total words: 28526
Sample: ['a', 'aa', 'aaa', 'aaai', 'aai', 'aak', 'aal', 'aalim', 'aam', 'aan']
config = {
# ===== TEXT CLEANING =====
# HTML & Tags
"clean_html": True, # Remove HTML tags
# URLs
"remove_urls": True, # Remove complete URLs
"clean_urls": True, # Remove URL protocols (http://, https://)
# Social Media
"remove_mentions": True, # Remove @mentions completely
"clean_mentions": True, # Remove @ but keep username
"remove_hashtags": True, # Remove #hashtags completely
"clean_hashtags": True, # Remove # but keep tag text
# Content Removal
"remove_emoji": True, # Remove emoji characters
"remove_punctuation": True, # Remove punctuation marks
"remove_numbers": True, # Remove numbers
"remove_emails": True, # Remove email addresses
"remove_phones": True, # Remove phone numbers
"remove_currency": True, # Remove currency symbols
# Text Cleaning
"remove_special_chars": True, # Remove special characters
"remove_extra_spaces": True, # Normalize whitespace
"remove_repeated_chars": True, # Normalize repeated characters (e.g., "haiiii" → "haii")
"remove_whitespace": True, # Clean tabs, newlines, etc.
"remove_lowercase": True, # Convert to lowercase
# ===== TEXT NORMALIZATION =====
"emoji_to_text": True, # Convert emojis to text description
"text_to_emoji": True, # Convert text to emojis
"spell_corrector_word": True, # Correct spelling for single words
"spell_corrector_sentence": True, # Correct spelling for sentences
# ===== LINGUISTIC PROCESSING =====
"stem": True, # Apply stemming (reduce to root form)
"stopword": True, # Remove stopwords
"tokenizer": True, # Tokenize text
# ===== TEXT REPLACEMENT =====
"replace_email": True, # Replace emails with <email>
"replace_link": True, # Replace URLs with <link>
"replace_user": True, # Replace mentions with <user>
}- For Social Media: Use
clean_*instead ofremove_*to keep the text content - For Formal Text: Use
spell_corrector_sentenceto normalize slang - For ML/NLP: Combine
stem,stopword, andremove_lowercase - For Anonymization: Use
replace_*options
class Pipeline:
"""
Configurable text preprocessing pipeline for Indonesian text.
Args:
config (dict): Dictionary of preprocessing steps {step_name: True/False}
Methods:
process(text: str) -> str: Process text through the pipeline
update_config(new_config: dict) -> None: Update pipeline configuration
get_enabled_steps() -> list: Get list of enabled processing steps
__call__(text: str) -> str: Allow pipeline to be called as a function
Example:
>>> config = {"clean_html": True, "stopword": True}
>>> pipeline = Pipeline(config)
>>> result = pipeline.process("<p>Saya sedang belajar NLP</p>")
>>> # or use as callable
>>> result = pipeline("<p>Saya sedang belajar NLP</p>")
"""See Pipeline Configuration Options for complete list.
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=nahiarhdNLP --cov-report=html
# Run specific test file
pytest nahiarhdNLP/tests/test_pipeline.py# Format code with black
black nahiarhdNLP/
# Sort imports with isort
isort nahiarhdNLP/
# Lint with flake8
flake8 nahiarhdNLP/# Install build tools
pip install build twine
# Build distributions
python -m build
# Upload to TestPyPI
twine upload --repository testpypi dist/*
# Upload to PyPI
twine upload dist/*Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 style guide
- Add tests for new features
- Update documentation
- Add examples for new functionality
This project is licensed under the MIT License - see the LICENSE file for details.
Raihan Hidayatullah Djunaedi
- Email: [email protected]
- GitHub: @raihanhd12
- Sastrawi - Indonesian stemming library
- Indonesian NLP Community - For datasets and inspiration
- All contributors who helped improve this library
Made with ❤️ for Indonesian NLP Community