🇮🇩 nahiarhdNLP

Advanced Indonesian Natural Language Processing Library

Lightweight, powerful, and easy-to-use Indonesian text preprocessing library

Installation • Quick Start • Features • Examples • Documentation

📚 Table of Contents

🌟 Overview

nahiarhdNLP adalah library Python yang dirancang khusus untuk preprocessing teks Bahasa Indonesia. Library ini menyediakan berbagai fungsi untuk membersihkan, menormalisasi, dan memproses teks dengan mudah dan efisien.

✨ Key Features

🔧 Configurable Pipeline - Build custom text processing workflows
🧹 Comprehensive Text Cleaning - Remove HTML, URLs, mentions, hashtags, emojis, and more
📝 Text Normalization - Emoji conversion, spell correction, slang normalization
🔤 Linguistic Processing - Stemming, stopword removal, tokenization
🔄 Text Replacement - Replace emails, links, and mentions with tokens
📊 Built-in Datasets - Indonesian stopwords, slang dictionary, emoji mappings
⚡ High Performance - Lazy loading and optimized processing
🎯 Easy to Use - Simple, intuitive API

📦 Installation

Using pip

pip install nahiarhdNLP

From source

git clone https://github.com/raihanhd12/nahiarhdNLP.git
cd nahiarhdNLP
pip install -e .

Requirements

Python >= 3.8
pandas >= 1.3.0
sastrawi >= 1.0.1
rich >= 12.0.0

🚀 Quick Start

from nahiarhdNLP.preprocessing import Pipeline

# Create a pipeline with configuration
config = {
    "clean_html": True,
    "clean_mentions": True,
    "remove_urls": True,
    "stopword": True
}

pipeline = Pipeline(config)

# Process text
text = "Haii @user!! Cek website kita di https://example.com ya 😊"
result = pipeline.process(text)

print(result)
# Output: "Haii Cek website kita ya 😊"

🎯 Features

🧹 Text Cleaning

Feature	Description	Config Key
HTML Removal	Remove HTML tags	`clean_html`
URL Removal	Remove complete URLs	`remove_urls`
URL Cleaning	Remove URL protocols only	`clean_urls`
Mention Removal	Remove @mentions	`remove_mentions`
Mention Cleaning	Remove @ but keep username	`clean_mentions`
Hashtag Removal	Remove #hashtags	`remove_hashtags`
Hashtag Cleaning	Remove # but keep tag text	`clean_hashtags`
Emoji Removal	Remove all emojis	`remove_emoji`
Punctuation Removal	Remove punctuation marks	`remove_punctuation`
Number Removal	Remove all numbers	`remove_numbers`
Email Removal	Remove email addresses	`remove_emails`
Phone Removal	Remove phone numbers	`remove_phones`
Currency Removal	Remove currency symbols	`remove_currency`
Special Char Removal	Remove special characters	`remove_special_chars`
Extra Spaces	Normalize whitespace	`remove_extra_spaces`
Repeated Chars	Normalize repeated chars	`remove_repeated_chars`
Whitespace Cleaning	Clean tabs, newlines, etc.	`remove_whitespace`

📝 Text Normalization

Feature	Description	Config Key
Emoji to Text	Convert emojis to text	`emoji_to_text`
Text to Emoji	Convert text to emojis	`text_to_emoji`
Spell Correction (Word)	Correct spelling & slang (single word)	`spell_corrector_word`
Spell Correction (Sentence)	Correct spelling & slang (full sentence)	`spell_corrector_sentence`
Lowercase	Convert to lowercase	`remove_lowercase`

🔤 Linguistic Processing

Feature	Description	Config Key
Stemming	Reduce words to root form	`stem`
Stopword Removal	Remove Indonesian stopwords	`stopword`
Tokenization	Split text into tokens	`tokenizer`

🔄 Text Replacement

Feature	Description	Config Key
Email Replacement	Replace emails with `<email>`	`replace_email`
Link Replacement	Replace URLs with `<link>`	`replace_link`
User Replacement	Replace mentions with `<user>`	`replace_user`

💡 Comprehensive Examples

1. Pipeline Configuration

Example 1.1: Basic Pipeline

from nahiarhdNLP.preprocessing import Pipeline

# Configure pipeline
config = {
    "clean_html": True,
    "clean_mentions": True,
    "remove_urls": True
}

pipeline = Pipeline(config)

# Input
text = "Hello <b>World</b>! Mention @user123 and visit https://example.com"

# Process
result = pipeline.process(text)

print(f"Input : {text}")
print(f"Output: {result}")

Output:

Input : Hello <b>World</b>! Mention @user123 and visit https://example.com
Output: Hello World! Mention user123 and visit

Example 1.2: Social Media Text Cleaning

from nahiarhdNLP.preprocessing import Pipeline

config = {
    "clean_html": True,
    "clean_mentions": True,
    "clean_hashtags": True,
    "remove_urls": True,
    "remove_emoji": True,
    "remove_extra_spaces": True
}

pipeline = Pipeline(config)

# Input - Typical social media post
text = """
Haiii gengs!! 😍😍
Jangan lupa follow @nahiarhdNLP ya!
Cek website kita di https://github.com/nahiarhd
#NLP #IndonesianNLP #TextProcessing 🚀
"""

result = pipeline.process(text)

print("=" * 60)
print("INPUT:")
print(text)
print("=" * 60)
print("OUTPUT:")
print(result)
print("=" * 60)

Output:

============================================================
INPUT:

Haiii gengs!! 😍😍
Jangan lupa follow @nahiarhdNLP ya!
Cek website kita di https://github.com/nahiarhd
#NLP #IndonesianNLP #TextProcessing 🚀

============================================================
OUTPUT:
Haiii gengs!! Jangan lupa follow nahiarhdNLP ya! Cek website kita di NLP IndonesianNLP TextProcessing
============================================================

Example 1.3: Update Pipeline Configuration

from nahiarhdNLP.preprocessing import Pipeline

# Initial configuration
config = {"clean_html": True, "remove_urls": True}
pipeline = Pipeline(config)

text = "<p>Visit https://example.com for more info</p>"

print(f"Initial Output: {pipeline.process(text)}")
# Output: Visit for more info

# Update configuration
pipeline.update_config({"remove_punctuation": True})

print(f"Updated Output: {pipeline.process(text)}")
# Output: Visit for more info

# Check enabled steps
print(f"Enabled steps: {pipeline.get_enabled_steps()}")
# Output: ['clean_html', 'remove_urls', 'remove_punctuation']

Example 1.4: Feature Discovery

from nahiarhdNLP.preprocessing import Pipeline

# Get all available features
all_features = Pipeline.get_available_steps()

print("All Available Features:")
for feature_name, description in sorted(all_features.items()):
    print(f"  {feature_name:25} - {description}")

print(f"\nTotal Features: {len(all_features)}")

# Get features organized by category
features_by_category = Pipeline.get_available_steps_by_category()

print("\nFeatures by Category:")
for category, feature_names in features_by_category.items():
    print(f"\n{category}:")
    for feature_name in feature_names:
        description = all_features.get(feature_name, "No description")
        print(f"  {feature_name:25} - {description}")

Output:

All Available Features:
  clean_hashtags           - Remove # symbol but keep tag text
  clean_html               - Remove HTML tags from text
  clean_mentions           - Remove @ symbol but keep username
  clean_urls               - Remove URL protocols (http://, https://) but keep domain
  emoji_to_text            - Convert emojis to Indonesian text description
  remove_currency          - Remove currency symbols
  remove_emails            - Remove email addresses
  remove_emoji             - Remove all emoji characters
  ... (28 features total)

Total Features: 28

Features by Category:

HTML & Tags:
  clean_html               - Remove HTML tags from text

URLs:
  remove_urls              - Remove complete URLs from text
  clean_urls               - Remove URL protocols (http://, https://) but keep domain

... (8 categories total)

2. Text Cleaning

Example 2.1: HTML Tag Removal

from nahiarhdNLP.preprocessing import Pipeline

config = {"clean_html": True}
pipeline = Pipeline(config)

# Test various HTML tags
examples = [
    "<p>This is a paragraph</p>",
    "<div class='container'>Content here</div>",
    "Normal text <b>bold text</b> <i>italic</i>",
    "<script>alert('test')</script>Clean text"
]

for text in examples:
    result = pipeline.process(text)
    print(f"Input : {text}")
    print(f"Output: {result}")
    print("-" * 60)

Output:

Input : <p>This is a paragraph</p>
Output: This is a paragraph
------------------------------------------------------------
Input : <div class='container'>Content here</div>
Output: Content here
------------------------------------------------------------
Input : Normal text <b>bold text</b> <i>italic</i>
Output: Normal text bold text italic
------------------------------------------------------------
Input : <script>alert('test')</script>Clean text
Output: Clean text
------------------------------------------------------------

Example 2.2: URL Processing

from nahiarhdNLP.preprocessing import Pipeline

# Remove URLs completely
config_remove = {"remove_urls": True}
pipeline_remove = Pipeline(config_remove)

# Clean URLs (remove protocol only)
config_clean = {"clean_urls": True}
pipeline_clean = Pipeline(config_clean)

text = "Visit https://github.com and http://example.com for more info"

print(f"Original     : {text}")
print(f"Remove URLs  : {pipeline_remove.process(text)}")
print(f"Clean URLs   : {pipeline_clean.process(text)}")

Output:

Original     : Visit https://github.com and http://example.com for more info
Remove URLs  : Visit and for more info
Clean URLs   : Visit github.com and example.com for more info

Example 2.3: Mention & Hashtag Processing

from nahiarhdNLP.preprocessing import Pipeline

text = "Hey @john_doe and @jane! Check out #Python #MachineLearning #AI"

# Remove mentions and hashtags
config_remove = {"remove_mentions": True, "remove_hashtags": True}
pipeline_remove = Pipeline(config_remove)

# Clean mentions and hashtags (keep text)
config_clean = {"clean_mentions": True, "clean_hashtags": True}
pipeline_clean = Pipeline(config_clean)

print(f"Original        : {text}")
print(f"Remove @#       : {pipeline_remove.process(text)}")
print(f"Clean @# (keep) : {pipeline_clean.process(text)}")

Output:

Original        : Hey @john_doe and @jane! Check out #Python #MachineLearning #AI
Remove @#       : Hey and ! Check out
Clean @# (keep) : Hey john_doe and jane! Check out Python MachineLearning AI

Example 2.4: Emoji Handling

from nahiarhdNLP.preprocessing import Pipeline

config = {"remove_emoji": True}
pipeline = Pipeline(config)

examples = [
    "I love Python 🐍❤️",
    "Great work! 👍😊🎉",
    "Weather today ☀️🌧️⛈️",
]

for text in examples:
    result = pipeline.process(text)
    print(f"Input : {text}")
    print(f"Output: {result}")
    print()

Output:

Input : I love Python 🐍❤️
Output: I love Python

Input : Great work! 👍😊🎉
Output: Great work!

Input : Weather today ☀️🌧️⛈️
Output: Weather today

Example 2.5: Repeated Characters Normalization

from nahiarhdNLP.preprocessing import Pipeline

config = {"remove_repeated_chars": True}
pipeline = Pipeline(config)

examples = [
    "Haiiiii guys!!!",
    "Kangennnnn bangetttt",
    "Wowwwww kerennn",
    "Makasiiih yaaaa"
]

for text in examples:
    result = pipeline.process(text)
    print(f"Input : {text}")
    print(f"Output: {result}")

Output:

Input : Haiiiii guys!!!
Output: Haiii guys!!

Input : Kangennnnn bangetttt
Output: Kangenn bangett

Input : Wowwwww kerennn
Output: Wowww kerenn

Input : Makasiiih yaaaa
Output: Makasiih yaa

3. Text Normalization

Example 3.1: Emoji Conversion

from nahiarhdNLP.preprocessing.normalization.emoji import EmojiConverter

emoji = EmojiConverter()
emoji._load_data()

# Emoji to Text
text_with_emoji = "Hari ini cuaca cerah ☀️ dan saya senang 😊"
result = emoji.emoji_to_text_convert(text_with_emoji)
print(f"Emoji to Text:")
print(f"Input : {text_with_emoji}")
print(f"Output: {result}")
print()

# Text to Emoji (example - depends on your emoji dataset)
text = "saya senang wajah tersenyum"
result = emoji.text_to_emoji_convert(text)
print(f"Text to Emoji:")
print(f"Input : {text}")
print(f"Output: {result}")

Output:

Emoji to Text:
Input : Hari ini cuaca cerah ☀️ dan saya senang 😊
Output: Hari ini cuaca cerah matahari dan saya senang wajah_tersenyum

Text to Emoji:
Input : saya senang wajah tersenyum
Output: saya senang 😊

Example 3.2: Spell Correction & Slang Normalization

from nahiarhdNLP.preprocessing.normalization.spell_corrector import SpellCorrector

spell = SpellCorrector()

# Single word correction
words = ["sya", "tdk", "gk", "org", "yg", "dgn"]
print("Word Correction:")
for word in words:
    corrected = spell.correct_word(word)
    print(f"  {word:10s} → {corrected}")

print("\n" + "="*60 + "\n")

# Sentence correction
sentences = [
    "gw lg di rmh",
    "gmn kabar lo?",
    "knp gk dtg?",
    "jgn lupa ya"
]

print("Sentence Correction:")
for sent in sentences:
    corrected = spell.correct_sentence(sent)
    print(f"Input : {sent}")
    print(f"Output: {corrected}")
    print()

Output:

Word Correction:
  sya        → saya
  tdk        → tidak
  gk         → tidak
  org        → orang
  yg         → yang
  dgn        → dengan

============================================================

Sentence Correction:
Input : gw lg di rmh
Output: gue lagi di rumah

Input : gmn kabar lo?
Output: gimana kabar kamu?

Input : knp gk dtg?
Output: kenapa tidak datang?

Input : jgn lupa ya
Output: jangan lupa ya

Example 3.3: Complete Text Normalization Pipeline

from nahiarhdNLP.preprocessing import Pipeline

# Comprehensive normalization pipeline
config = {
    "clean_html": True,
    "clean_mentions": True,
    "clean_hashtags": True,
    "remove_urls": True,
    "remove_emoji": True,
    "remove_extra_spaces": True,
    "remove_repeated_chars": True,
    "spell_corrector_sentence": True,
    "remove_lowercase": True
}

pipeline = Pipeline(config)

# Messy Indonesian text
text = """
Haiii @temans!! 😍 Kmrn gw udh coba apps baruu loh di https://example.com
#KerenBanget #Recommended Gkkkk nyesel dehhhh!!! 🚀🚀
"""

result = pipeline.process(text)

print("=" * 70)
print("ORIGINAL TEXT:")
print(text)
print("=" * 70)
print("NORMALIZED TEXT:")
print(result)
print("=" * 70)

Output:

======================================================================
ORIGINAL TEXT:

Haiii @temans!! 😍 Kmrn gw udh coba apps baruu loh di https://example.com
#KerenBanget #Recommended Gkkkk nyesel dehhhh!!! 🚀🚀

======================================================================
NORMALIZED TEXT:
haiii temans!! kemarin gue sudah coba apps baruu loh di kerenbangett recommendedd gkk nyesell dehh!!!
======================================================================

4. Linguistic Processing

Example 4.1: Stemming

from nahiarhdNLP.preprocessing.linguistic.stemmer import Stemmer

stemmer = Stemmer()

# Test various Indonesian words
words = [
    "bermain",      # playing
    "berlari",      # running
    "kebahagiaan",  # happiness
    "pembelajaran", # learning
    "menyenangkan", # enjoyable
    "berkomunikasi" # communicate
]

print("Indonesian Stemming:")
print(f"{'Word':<20} → {'Stem'}")
print("-" * 40)
for word in words:
    stem = stemmer.stem(word)
    print(f"{word:<20} → {stem}")

print("\n" + "="*60 + "\n")

# Sentence stemming
sentences = [
    "Saya sedang belajar pemrograman Python",
    "Mereka bermain bola di lapangan",
    "Kebahagiaan adalah kunci kesuksesan"
]

print("Sentence Stemming:")
for sent in sentences:
    stemmed = stemmer.stem(sent)
    print(f"Input : {sent}")
    print(f"Output: {stemmed}")
    print()

Output:

Indonesian Stemming:
Word                 → Stem
----------------------------------------
bermain              → main
berlari              → lari
kebahagiaan          → bahagia
pembelajaran         → ajar
menyenangkan         → senang
berkomunikasi        → komunikasi

============================================================

Sentence Stemming:
Input : Saya sedang belajar pemrograman Python
Output: saya sedang ajar program python

Input : Mereka bermain bola di lapangan
Output: mereka main bola di lapang

Input : Kebahagiaan adalah kunci kesuksesan
Output: bahagia adalah kunci sukses

Example 4.2: Stopword Removal

from nahiarhdNLP.preprocessing.linguistic.stopword import StopwordRemover

stopword = StopwordRemover()
stopword._load_data()

# Test sentences
sentences = [
    "Saya sedang belajar bahasa pemrograman Python untuk data science",
    "Mereka akan pergi ke pasar besok pagi",
    "Ini adalah contoh kalimat dengan banyak stopwords yang harus dihapus"
]

print("Stopword Removal:")
print("=" * 70)
for sent in sentences:
    cleaned = stopword.remove_stopwords(sent)
    print(f"Original: {sent}")
    print(f"Cleaned : {cleaned}")
    print("-" * 70)

Output:

Stopword Removal:
======================================================================
Original: Saya sedang belajar bahasa pemrograman Python untuk data science
Cleaned : belajar bahasa pemrograman Python data science
----------------------------------------------------------------------
Original: Mereka akan pergi ke pasar besok pagi
Cleaned : pasar besok pagi
----------------------------------------------------------------------
Original: Ini adalah contoh kalimat dengan banyak stopwords yang harus dihapus
Cleaned : contoh kalimat stopwords dihapus
----------------------------------------------------------------------

Example 4.3: Complete Linguistic Pipeline

from nahiarhdNLP.preprocessing import Pipeline

# Linguistic processing pipeline
config = {
    "remove_lowercase": True,
    "stopword": True,
    "stem": True,
    "remove_extra_spaces": True
}

pipeline = Pipeline(config)

texts = [
    "Saya sedang mengembangkan aplikasi pembelajaran online",
    "Mereka bermain musik dengan sangat menyenangkan",
    "Kebahagiaan adalah perjalanan bukan tujuan"
]

print("Complete Linguistic Processing:")
print("=" * 70)
for text in texts:
    result = pipeline.process(text)
    print(f"Original : {text}")
    print(f"Processed: {result}")
    print("-" * 70)

Output:

Complete Linguistic Processing:
======================================================================
Original : Saya sedang mengembangkan aplikasi pembelajaran online
Processed: kembang aplikasi ajar online
----------------------------------------------------------------------
Original : Mereka bermain musik dengan sangat menyenangkan
Processed: main musik senang
----------------------------------------------------------------------
Original : Kebahagiaan adalah perjalanan bukan tujuan
Processed: bahagia jalan tuju
----------------------------------------------------------------------

Example 4.4: Tokenization

from nahiarhdNLP.preprocessing.tokenization.tokenizer import Tokenizer

tokenizer = Tokenizer()

texts = [
    "Ini adalah contoh kalimat sederhana",
    "Python, Java, dan JavaScript adalah bahasa pemrograman",
    "Email: [email protected], Website: https://example.com"
]

print("Tokenization Examples:")
print("=" * 70)
for text in texts:
    tokens = tokenizer.tokenize(text)
    print(f"Text  : {text}")
    print(f"Tokens: {tokens}")
    print("-" * 70)

Output:

Tokenization Examples:
======================================================================
Text  : Ini adalah contoh kalimat sederhana
Tokens: ['Ini', 'adalah', 'contoh', 'kalimat', 'sederhana']
----------------------------------------------------------------------
Text  : Python, Java, dan JavaScript adalah bahasa pemrograman
Tokens: ['Python', ',', 'Java', ',', 'dan', 'JavaScript', 'adalah', 'bahasa', 'pemrograman']
----------------------------------------------------------------------
Text  : Email: [email protected], Website: https://example.com
Tokens: ['Email', ':', '[email protected]', ',', 'Website', ':', 'https://example.com']
----------------------------------------------------------------------

5. Text Replacement

Example 5.1: Email, Link, and Mention Replacement

from nahiarhdNLP.preprocessing import Pipeline

# Configure replacement pipeline
config = {
    "replace_email": True,
    "replace_link": True,
    "replace_user": True
}

pipeline = Pipeline(config)

examples = [
    "Contact me at [email protected] for more info",
    "Visit https://github.com/nahiarhd for the code",
    "Thanks @john and @jane for your help!",
    "Email: [email protected] | Web: https://company.com | Twitter: @company"
]

print("Text Replacement:")
print("=" * 70)
for text in examples:
    result = pipeline.process(text)
    print(f"Input : {text}")
    print(f"Output: {result}")
    print("-" * 70)

Output:

Text Replacement:
======================================================================
Input : Contact me at [email protected] for more info
Output: Contact me at <email> for more info
----------------------------------------------------------------------
Input : Visit https://github.com/nahiarhd for the code
Output: Visit <link> for the code
----------------------------------------------------------------------
Input : Thanks @john and @jane for your help!
Output: Thanks <user> and <user> for your help!
----------------------------------------------------------------------
Input : Email: [email protected] | Web: https://company.com | Twitter: @company
Output: Email: <email> | Web: <link> | Twitter: <user>
----------------------------------------------------------------------

Example 5.2: Data Anonymization Pipeline

from nahiarhdNLP.preprocessing import Pipeline

# Complete anonymization pipeline
config = {
    "replace_email": True,
    "replace_link": True,
    "replace_user": True,
    "remove_phones": True,
    "clean_html": True
}

pipeline = Pipeline(config)

# Sensitive data example
text = """
<div class="contact">
Customer: @johndoe
Email: [email protected]
Phone: +62-812-3456-7890
Website: https://customer-site.com
</div>
"""

result = pipeline.process(text)

print("DATA ANONYMIZATION")
print("=" * 70)
print("ORIGINAL:")
print(text)
print("=" * 70)
print("ANONYMIZED:")
print(result)
print("=" * 70)

Output:

DATA ANONYMIZATION
======================================================================
ORIGINAL:

<div class="contact">
Customer: @johndoe
Email: [email protected]
Phone: +62-812-3456-7890
Website: https://customer-site.com
</div>

======================================================================
ANONYMIZED:
Customer: <user> Email: <email> Phone: Website: <link>
======================================================================

6. Dataset Loaders

Example 6.1: Loading Built-in Datasets

from nahiarhdNLP.datasets import DatasetLoader

loader = DatasetLoader()

# Load stopwords
stopwords = loader.load_stopwords_dataset()
print(f"📚 Stopwords Dataset:")
print(f"   Total words: {len(stopwords)}")
print(f"   Sample: {stopwords[:10]}")
print()

# Load slang dictionary
slang_dict = loader.load_slang_dataset()
print(f"💬 Slang Dictionary:")
print(f"   Total entries: {len(slang_dict)}")
print(f"   Sample mappings:")
for slang, formal in list(slang_dict.items())[:5]:
    print(f"      {slang:10s} → {formal}")
print()

# Load emoji dictionary
emoji_dict = loader.load_emoji_dataset()
print(f"😊 Emoji Dictionary:")
print(f"   Total emojis: {len(emoji_dict)}")
print(f"   Sample mappings:")
for emoji, text in list(emoji_dict.items())[:5]:
    print(f"      {emoji:5s} → {text}")
print()

# Load wordlist
wordlist = loader.load_wordlist_dataset()
print(f"📖 Wordlist Dataset:")
print(f"   Total words: {len(wordlist)}")
print(f"   Sample: {wordlist[:10]}")

Output:

📚 Stopwords Dataset:
   Total words: 758
   Sample: ['ada', 'adalah', 'adanya', 'adapun', 'agak', 'agaknya', 'agar', 'akan', 'akankah', 'akhir']

💬 Slang Dictionary:
   Total entries: 3592
   Sample mappings:
      gw         → gue
      lo         → kamu
      gak        → tidak
      yg         → yang
      dgn        → dengan

😊 Emoji Dictionary:
   Total emojis: 1800
   Sample mappings:
      😀     → wajah_tersenyum
      😁     → wajah_gembira
      😂     → tertawa_terbahak
      🤣     → tertawa_guling
      😃     → senyum_lebar

📖 Wordlist Dataset:
   Total words: 28526
   Sample: ['a', 'aa', 'aaa', 'aaai', 'aai', 'aak', 'aal', 'aalim', 'aam', 'aan']

⚙️ Pipeline Configuration Options

Complete Configuration Reference

config = {
    # ===== TEXT CLEANING =====
    # HTML & Tags
    "clean_html": True,              # Remove HTML tags

    # URLs
    "remove_urls": True,             # Remove complete URLs
    "clean_urls": True,              # Remove URL protocols (http://, https://)

    # Social Media
    "remove_mentions": True,         # Remove @mentions completely
    "clean_mentions": True,          # Remove @ but keep username
    "remove_hashtags": True,         # Remove #hashtags completely
    "clean_hashtags": True,          # Remove # but keep tag text

    # Content Removal
    "remove_emoji": True,            # Remove emoji characters
    "remove_punctuation": True,      # Remove punctuation marks
    "remove_numbers": True,          # Remove numbers
    "remove_emails": True,           # Remove email addresses
    "remove_phones": True,           # Remove phone numbers
    "remove_currency": True,         # Remove currency symbols

    # Text Cleaning
    "remove_special_chars": True,    # Remove special characters
    "remove_extra_spaces": True,     # Normalize whitespace
    "remove_repeated_chars": True,   # Normalize repeated characters (e.g., "haiiii" → "haii")
    "remove_whitespace": True,       # Clean tabs, newlines, etc.
    "remove_lowercase": True,        # Convert to lowercase

    # ===== TEXT NORMALIZATION =====
    "emoji_to_text": True,           # Convert emojis to text description
    "text_to_emoji": True,           # Convert text to emojis
    "spell_corrector_word": True,    # Correct spelling for single words
    "spell_corrector_sentence": True, # Correct spelling for sentences

    # ===== LINGUISTIC PROCESSING =====
    "stem": True,                    # Apply stemming (reduce to root form)
    "stopword": True,                # Remove stopwords
    "tokenizer": True,               # Tokenize text

    # ===== TEXT REPLACEMENT =====
    "replace_email": True,           # Replace emails with <email>
    "replace_link": True,            # Replace URLs with <link>
    "replace_user": True,            # Replace mentions with <user>
}

Configuration Tips

For Social Media: Use clean_* instead of remove_* to keep the text content
For Formal Text: Use spell_corrector_sentence to normalize slang
For ML/NLP: Combine stem, stopword, and remove_lowercase
For Anonymization: Use replace_* options

📖 API Documentation

Pipeline Class

class Pipeline:
    """
    Configurable text preprocessing pipeline for Indonesian text.

    Args:
        config (dict): Dictionary of preprocessing steps {step_name: True/False}

    Methods:
        process(text: str) -> str: Process text through the pipeline
        update_config(new_config: dict) -> None: Update pipeline configuration
        get_enabled_steps() -> list: Get list of enabled processing steps
        __call__(text: str) -> str: Allow pipeline to be called as a function

    Example:
        >>> config = {"clean_html": True, "stopword": True}
        >>> pipeline = Pipeline(config)
        >>> result = pipeline.process("<p>Saya sedang belajar NLP</p>")
        >>> # or use as callable
        >>> result = pipeline("<p>Saya sedang belajar NLP</p>")
    """

Available Processing Steps

See Pipeline Configuration Options for complete list.

🛠️ Development

Running Tests

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=nahiarhdNLP --cov-report=html

# Run specific test file
pytest nahiarhdNLP/tests/test_pipeline.py

Code Formatting

# Format code with black
black nahiarhdNLP/

# Sort imports with isort
isort nahiarhdNLP/

# Lint with flake8
flake8 nahiarhdNLP/

Building Package

# Install build tools
pip install build twine

# Build distributions
python -m build

# Upload to TestPyPI
twine upload --repository testpypi dist/*

# Upload to PyPI
twine upload dist/*

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Guidelines

Follow PEP 8 style guide
Add tests for new features
Update documentation
Add examples for new functionality

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

👤 Author

Raihan Hidayatullah Djunaedi

Email: [email protected]
GitHub: @raihanhd12

🙏 Acknowledgments

Sastrawi - Indonesian stemming library
Indonesian NLP Community - For datasets and inspiration
All contributors who helped improve this library

📊 Project Statistics

Made with ❤️ for Indonesian NLP Community

⬆ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
nahiarhdNLP		nahiarhdNLP
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🇮🇩 nahiarhdNLP

Advanced Indonesian Natural Language Processing Library

📚 Table of Contents

🌟 Overview

✨ Key Features

📦 Installation

Using pip

From source

Requirements

🚀 Quick Start

🎯 Features

🧹 Text Cleaning

📝 Text Normalization

🔤 Linguistic Processing

🔄 Text Replacement

💡 Comprehensive Examples

1. Pipeline Configuration

Example 1.1: Basic Pipeline

Example 1.2: Social Media Text Cleaning

Example 1.3: Update Pipeline Configuration

Example 1.4: Feature Discovery

2. Text Cleaning

Example 2.1: HTML Tag Removal

Example 2.2: URL Processing

Example 2.3: Mention & Hashtag Processing

Example 2.4: Emoji Handling

Example 2.5: Repeated Characters Normalization

3. Text Normalization

Example 3.1: Emoji Conversion

Example 3.2: Spell Correction & Slang Normalization

Example 3.3: Complete Text Normalization Pipeline

4. Linguistic Processing

Example 4.1: Stemming

Example 4.2: Stopword Removal

Example 4.3: Complete Linguistic Pipeline

Example 4.4: Tokenization

5. Text Replacement

Example 5.1: Email, Link, and Mention Replacement

Example 5.2: Data Anonymization Pipeline

6. Dataset Loaders

Example 6.1: Loading Built-in Datasets

⚙️ Pipeline Configuration Options

Complete Configuration Reference

Configuration Tips

📖 API Documentation

Pipeline Class

Available Processing Steps

🛠️ Development

Running Tests

Code Formatting

Building Package

🤝 Contributing

Development Guidelines

📝 License

👤 Author

🙏 Acknowledgments

📊 Project Statistics

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages