This directory contains utility scripts for maintaining and improving the Instructor documentation and project structure.
Purpose: Cleans markdown files by removing special whitespace characters and replacing em dashes with regular dashes.
What it does:
- Recursively finds all
.mdfiles in thedocs/directory - Removes special Unicode whitespace characters (non-breaking spaces, zero-width spaces, etc.)
- Replaces em dashes (
—) and en dashes (–) with regular dashes (-) - Preserves intentional formatting while cleaning problematic characters
Usage:
# Clean all markdown files in docs/
python scripts/make_clean.py
# Dry run to see what would be changed
python scripts/make_clean.py --dry-run
# Clean files in a different directory
python scripts/make_clean.py --docs-dir path/to/docsPre-commit Integration: This script runs automatically on commits that include markdown files in the docs/ directory.
Purpose: Ensures all blog posts contain the <!-- more --> tag for proper excerpt handling.
What it does:
- Scans all markdown files in
docs/blog/posts/ - Checks for the presence of
<!-- more -->tags - Reports files missing the tag
- Exits with error code 1 if any files are missing the tag
Usage:
# Check all blog posts
python scripts/check_blog_excerpts.py
# Check posts in a different directory
python scripts/check_blog_excerpts.py --blog-posts-dir path/to/postsPre-commit Integration: This script runs automatically on commits that include blog post files.
Purpose: Generates an enhanced sitemap (sitemap.yaml) with AI-powered content analysis and cross-link suggestions.
What it does:
- Recursively traverses the
docs/directory - Analyzes each markdown file using OpenAI's GPT-4o-mini
- Extracts summaries, keywords, and topics for SEO
- Identifies internal links and references
- Generates cross-link suggestions based on content similarity
- Creates a comprehensive
sitemap.yamlfile
Features:
- Caching: Reuses analysis for unchanged files (based on content hash)
- Concurrent Processing: Processes multiple files simultaneously
- Cross-linking: Suggests related documents based on content similarity
- Retry Logic: Handles API failures with exponential backoff
Usage:
# Generate sitemap with default settings
python scripts/make_sitemap.py
# Customize settings
python scripts/make_sitemap.py \
--root-dir docs \
--output-file sitemap.yaml \
--max-concurrency 10 \
--min-similarity 0.4
# Use custom API key
python scripts/make_sitemap.py --api-key your-openai-keyOutput: Creates sitemap.yaml with structure:
file.md:
summary: "Brief description of the content"
keywords: ["keyword1", "keyword2", "keyword3"]
topics: ["topic1", "topic2", "topic3"]
references: ["other-file.md", "another-file.md"]
ai_references: ["ai-detected-reference.md"]
cross_links: ["suggested-related-file.md"]
hash: "content-hash-for-caching"Requirements:
- OpenAI API key (set as
OPENAI_API_KEYenvironment variable or passed via--api-key) - Dependencies:
openai,typer,rich,tenacity,pyyaml
These scripts are integrated into the project's pre-commit hooks to ensure code quality:
make_clean.py: Runs on commits with markdown files indocs/check_blog_excerpts.py: Runs on commits with blog post files
The hooks are configured in .pre-commit-config.yaml and run automatically during the commit process.
You can run any script manually for testing or one-time operations:
# Test markdown cleaning
python scripts/make_clean.py --dry-run
# Check blog excerpts
python scripts/check_blog_excerpts.py
# Generate fresh sitemap
python scripts/make_sitemap.pyPurpose: Replaces old API call patterns with simplified versions for consistency.
What it does:
- Finds and replaces
client.chat.completions.create→client.create - Finds and replaces
client.chat.completions.create_partial→client.create_partial - Finds and replaces
client.chat.completions.create_iterable→client.create_iterable - Finds and replaces
client.chat.completions.create_with_completion→client.create_with_completion - Processes all markdown and notebook files in the docs directory
Usage:
# Dry run to see what would be changed
python scripts/fix_api_calls.py --dry-run
# Apply changes to all files
python scripts/fix_api_calls.py
# Process a single file
python scripts/fix_api_calls.py --file docs/index.md
# Custom docs directory
python scripts/fix_api_calls.py --docs-dir path/to/docsPurpose: Replaces old client initialization patterns with the modern from_provider API.
What it does:
- Replaces
instructor.from_openai(OpenAI())→instructor.from_provider("openai/model-name") - Replaces
instructor.from_anthropic(Anthropic())→instructor.from_provider("anthropic/model-name") - Replaces
instructor.patch(OpenAI())→instructor.from_provider("openai/model-name") - Handles all supported providers (OpenAI, Anthropic, Google, Cohere, Mistral, Groq, etc.)
- Attempts to extract model names from existing code
Usage:
# Dry run to see what would be changed
python scripts/fix_old_patterns.py --dry-run
# Apply changes to all files
python scripts/fix_old_patterns.py
# Process a single file
python scripts/fix_old_patterns.py --file docs/integrations/openai.mdNote: Model names are extracted from existing code when possible, but may need manual review for accuracy.
Purpose: Audits documentation files to find old patterns that need updating.
What it does:
- Finds old API call patterns (
client.chat.completions.*) - Finds old initialization patterns (
instructor.from_*,instructor.patch) - Identifies potentially unused imports
- Reports line numbers for each issue
- Provides summary statistics
Usage:
# Detailed report with line numbers
python scripts/audit_patterns.py
# Summary statistics only
python scripts/audit_patterns.py --summary
# Audit a single file
python scripts/audit_patterns.py --file docs/index.md
# Custom docs directory
python scripts/audit_patterns.py --docs-dir path/to/docsOutput: Reports issues by file with line numbers, or summary statistics showing total counts per pattern type.
When adding new scripts to this directory:
- Documentation: Add a section to this README explaining the script's purpose and usage
- Pre-commit Integration: If appropriate, add the script to
.pre-commit-config.yaml - Error Handling: Ensure scripts exit with appropriate error codes
- Help Text: Include
--helpfunctionality for command-line scripts - Testing: Test scripts manually before committing
Most scripts use only Python standard library modules. The sitemap generator requires additional dependencies:
uv add openai typer rich tenacity pyyamlPre-commit hooks failing:
- Check that scripts are executable:
chmod +x scripts/*.py - Verify script paths in
.pre-commit-config.yaml - Run scripts manually to identify issues
Sitemap generation issues:
- Ensure OpenAI API key is set correctly
- Check network connectivity for API calls
- Review error messages for specific file issues
Markdown cleaning issues:
- Use
--dry-runto preview changes - Check file permissions in the docs directory
- Verify UTF-8 encoding of markdown files