This project analyzes online forum discussions using Natural Language Processing (NLP) and Large Language Models (LLMs) to evaluate deliberation quality. The analysis focuses on structural features, topic modeling, and interaction patterns in Kaggle forum discussions.
project/
├── src/
│ ├── analysis.py # Topic modeling and text analysis
│ ├── preprocess.py # Data loading and feature extraction
│ └── visualize.py # Data visualization
├── data/
│ └── competition-hosting.json
├── docker/
│ └── Dockerfile
├── notebooks/
│ └── exploratory_analysis.ipynb
├── requirements.txt
├── main.py
└── README.md
- Advanced text preprocessing and cleaning
- Structural feature extraction from discussions
- Topic modeling using LDA
- Technical term detection
- Thread depth and interaction analysis
- Expertise level evaluation
- Docker containerization
- Interactive visualization of discussion networks
- Python 3.10.16
- NLTK
- Gensim
- NumPy
- Docker (optional)
# Create and activate conda environment
conda create -n master_project python=3.10.16
conda activate master_project
# Install dependencies
pip install -r requirements.txt
# Download NLTK data
python -m nltk.downloader stopwords wordnet
python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"# Build Docker image
docker build -t deliberation-analysis .
# Run container
docker run -it deliberation-analysisfrom src.preprocess import load_kaggle_data, extract_structural_features
from src.analysis import lda_analysis
# Load and analyze data
data = load_kaggle_data("./data/competition-hosting.json")
results = analyze_discussions(data)from src.visualize import plot_interaction_network
# Generate interaction network visualization
plot_interaction_network(results)EXPERTISE_RANKS = {
"Novice": 1,
"Contributor": 2,
"Expert": 3,
"Master": 4,
"Grandmaster": 5,
None: 0
}The project uses environment variables for configuration:
PYTHONPATH=/appDATA_DIR=/app/data
- Basic text preprocessing
- LDA implementation
- Docker support
- Add configuration file for constants
- Improve visualization options
- Handles missing data gracefully
- Includes debug outputs for analysis steps
- Uses adaptive parameters based on data size
- Preserves technical terminology during preprocessing
- Docker container includes all necessary NLTK data