Skip to content

Latest commit

 

History

History
86 lines (63 loc) · 2.67 KB

File metadata and controls

86 lines (63 loc) · 2.67 KB

Data Directory

This directory contains all output files from the insight extraction and threading pipeline.

Important Note

Data files are excluded from git via .gitignore due to their large size (27MB+ for extraction files). You will need to run the pipeline to generate these files.

Output Files

Final Output

  • threads_final.json - 20 curated threads (116 insights) - USE THIS FOR PRODUCTION
    • Manually curated thread names
    • Deduplicated insights (max 1 per episode per thread)
    • All threads span 2+ different episodes
    • Includes YouTube timestamp URLs

Extraction Results

  • modal_extraction_YYYYMMDD_HHMMSS.json - Raw insight extraction
    • 465 insights from 13,513 chunks
    • Includes novelty/specificity scores
    • Video URLs and timestamps

Threading Outputs

  • threads_v2_YYYYMMDD_HHMMSS.json - Raw threading results

    • Output from Louvain community detection
    • Before quality filtering and curation
  • named_threads_v2_YYYYMMDD_HHMMSS.json - Named threads

    • LLM-generated thread names (often ALL_CAPS)
    • Before manual curation

Quality Validation

  • quality_check_YYYYMMDD_HHMMSS.json - Quality scores
    • STRONG / MODERATE / WEAK / REJECT verdicts
    • Thread coherence metrics

Intermediate Files

  • threads_v2_clean.json - Cleaned threads before final curation
  • threads_v2_min2.json - All threads with 2+ insights (before filtering)
  • threads_v2_final.json - Final automated output (before manual name curation)
  • threads_validated.json - Validation results

Paul Graham Essays

  • pg_extraction_YYYYMMDD_HHMMSS.json - Insights from Paul Graham essays
  • pg_threads_final.json - Final threads from PG essays

File Naming Convention

Timestamped files follow the pattern: {type}_{YYYYMMDD}_{HHMMSS}.json

  • YYYYMMDD: Date (e.g., 20260120 = January 20, 2026)
  • HHMMSS: Time in 24-hour format

Running the Pipeline

To generate these files, run:

# Extract insights
modal run modal_extract.py --db your_database.db

# Find threads
python find_threads_v2.py --input data/modal_extraction_*.json

# Name threads
modal run name_clusters.py --input data/threads_*.json

# Check quality
modal run check_thread_quality.py --input data/named_threads_*.json

# Create final export
python create_final_export.py

File Sizes

Typical file sizes:

  • Extraction files: ~27MB (full insights with metadata)
  • Thread files: ~300KB - 1MB (depends on number of threads)
  • Final output: ~300KB (curated threads only)
  • Quality check: ~5KB (scores only)

Database Files

Database files (.db, .sqlite) are also gitignored. You'll need:

  • A SQLite database with chunks and documents tables
  • See README.md for schema requirements