This directory contains all output files from the insight extraction and threading pipeline.
Data files are excluded from git via .gitignore due to their large size (27MB+ for extraction files). You will need to run the pipeline to generate these files.
threads_final.json- 20 curated threads (116 insights) - USE THIS FOR PRODUCTION- Manually curated thread names
- Deduplicated insights (max 1 per episode per thread)
- All threads span 2+ different episodes
- Includes YouTube timestamp URLs
modal_extraction_YYYYMMDD_HHMMSS.json- Raw insight extraction- 465 insights from 13,513 chunks
- Includes novelty/specificity scores
- Video URLs and timestamps
-
threads_v2_YYYYMMDD_HHMMSS.json- Raw threading results- Output from Louvain community detection
- Before quality filtering and curation
-
named_threads_v2_YYYYMMDD_HHMMSS.json- Named threads- LLM-generated thread names (often ALL_CAPS)
- Before manual curation
quality_check_YYYYMMDD_HHMMSS.json- Quality scores- STRONG / MODERATE / WEAK / REJECT verdicts
- Thread coherence metrics
threads_v2_clean.json- Cleaned threads before final curationthreads_v2_min2.json- All threads with 2+ insights (before filtering)threads_v2_final.json- Final automated output (before manual name curation)threads_validated.json- Validation results
pg_extraction_YYYYMMDD_HHMMSS.json- Insights from Paul Graham essayspg_threads_final.json- Final threads from PG essays
Timestamped files follow the pattern: {type}_{YYYYMMDD}_{HHMMSS}.json
- YYYYMMDD: Date (e.g., 20260120 = January 20, 2026)
- HHMMSS: Time in 24-hour format
To generate these files, run:
# Extract insights
modal run modal_extract.py --db your_database.db
# Find threads
python find_threads_v2.py --input data/modal_extraction_*.json
# Name threads
modal run name_clusters.py --input data/threads_*.json
# Check quality
modal run check_thread_quality.py --input data/named_threads_*.json
# Create final export
python create_final_export.pyTypical file sizes:
- Extraction files: ~27MB (full insights with metadata)
- Thread files: ~300KB - 1MB (depends on number of threads)
- Final output: ~300KB (curated threads only)
- Quality check: ~5KB (scores only)
Database files (.db, .sqlite) are also gitignored. You'll need:
- A SQLite database with
chunksanddocumentstables - See
README.mdfor schema requirements