Skip to content

davidasnider/scripts

Repository files navigation

File Catalog CLI

Command-line application for cataloging and analyzing files. The tool builds a manifest of local documents, images, and videos, extracts text or embeddings, and enables downstream analysis with local and vector-backed AI tooling.

Features

  • Detects MIME types, extracts text from PDFs and Word documents, and performs OCR on images and scanned pages.
  • Generates embeddings and stores them in a local ChromaDB instance for fast semantic queries.
  • Integrates with a locally hosted Ollama LLM to power interactive analysis workflows.
  • Provides a modular command structure so you can swap out or extend analysis pipelines easily.

Repository Structure

This repository is organized into the following directories and files:

Directories

  • .github/: Contains GitHub-specific files, such as workflow definitions and issue templates.
  • scripts/: Holds standalone or one-off scripts for maintenance, automation, or specific tasks that are not part of the main application logic.
  • src/: Contains the core application source code, organized into modules with specific responsibilities.
  • tests/: Includes all automated tests for the project. The structure of this directory mirrors the src/ directory.

Files

  • .gitignore: Specifies intentionally untracked files to be ignored by Git.
  • .pre-commit-config.yaml: Configures pre-commit hooks, which are used to enforce code style and quality checks before committing code.
  • AGENTS.md: Provides guidelines and instructions for AI agents working with this repository.
  • Makefile: Contains make commands for common development tasks such as installing dependencies, running the application, and cleaning the project.
  • README.md: This file, providing an overview of the project.
  • app.py: The entry point for the Streamlit web interface, providing an interactive UI for the application.
  • config.yaml: A configuration file for setting up application parameters, such as model names and other settings.
  • main.py: The main orchestrator script for the entire processing pipeline, from file discovery to analysis and database storage.
  • pyproject.toml: The project's build and dependency management file, used by uv.
  • pytest.ini: The configuration file for pytest, the testing framework used in this project.
  • uv.lock: The lock file generated by uv to ensure deterministic dependency installation.

Getting started

make install
make setup-ollama
make run

This launches the Streamlit web interface for interactive querying of your digital archive.

Command-line tools

The project also includes command-line utilities for batch processing:

File Discovery

Create a manifest of files in a directory:

uv run python src/discover_files.py /path/to/directory

This generates data/manifest.json with file metadata including paths, MIME types, and sizes.

Main Processing Pipeline

Run the complete extraction and analysis pipeline:

uv run python main.py

This processes files from the manifest, extracts content, generates embeddings, and stores them in ChromaDB.

Legacy CLI (file_catalog)

uv run python -m file_catalog --help

The scan command takes a directory to catalog and writes a manifest file. The analyze command runs follow-up analyses using the manifest. Both commands currently print placeholders; fill in the implementation using the helpers in src/file_catalog/.

Content extraction utilities

The src/content_extractor.py module provides helpers for pulling text from various file types:

  • preprocess_for_ocr(image): Prepares a PIL image for OCR by converting to grayscale and applying binary thresholding.
  • extract_content_from_docx(file_path): Extracts all text from a .docx file using python-docx.
  • extract_content_from_image(file_path): Opens an image with Pillow, preprocesses it for OCR, and extracts text using pytesseract.
  • extract_content_from_pdf(file_path): Extracts text from PDFs using a hybrid approach: digital text first, OCR for scanned pages (detected by low text length).
  • extract_frames_from_video(file_path, output_dir, interval_sec): Extracts frames from videos at specified intervals and saves them as JPEG images.

These functions are designed to be called from the main CLI or standalone scripts for content processing.

Running the full pipeline

To run the complete cataloging and analysis pipeline:

uv run python main.py

This script loads the manifest, processes each file through extraction and AI analysis stages, updates the database, and saves progress incrementally for resumability.

Development notes

  • Ensure system dependencies for OCR are installed (Tesseract, poppler for pdf2image, etc.). On macOS, run make check-tesseract to install automatically if needed.
  • Ollama must be running for AI analysis features. Run make setup-ollama to start Ollama and download required models (llama3, deepseek-coder, llava).
  • To reset the project and start fresh, run make clean to remove the manifest file and ChromaDB database.
  • Torch and transformers may require additional system packages depending on your hardware. Consult their documentation for accelerated backends.
  • Add standalone scripts to the scripts/ directory when workflows need bespoke orchestration outside the core CLI interface.
  • Manage Python dependencies with uv: run uv add <package> to include new libraries and uv sync to refresh the virtual environment.
  • Use uv run pre-commit install once to activate git hooks, then uv run pre-commit run --all-files to verify formatting and linting locally.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors