Command-line application for cataloging and analyzing files. The tool builds a manifest of local documents, images, and videos, extracts text or embeddings, and enables downstream analysis with local and vector-backed AI tooling.
- Detects MIME types, extracts text from PDFs and Word documents, and performs OCR on images and scanned pages.
- Generates embeddings and stores them in a local ChromaDB instance for fast semantic queries.
- Integrates with a locally hosted Ollama LLM to power interactive analysis workflows.
- Provides a modular command structure so you can swap out or extend analysis pipelines easily.
This repository is organized into the following directories and files:
.github/: Contains GitHub-specific files, such as workflow definitions and issue templates.scripts/: Holds standalone or one-off scripts for maintenance, automation, or specific tasks that are not part of the main application logic.src/: Contains the core application source code, organized into modules with specific responsibilities.tests/: Includes all automated tests for the project. The structure of this directory mirrors thesrc/directory.
.gitignore: Specifies intentionally untracked files to be ignored by Git..pre-commit-config.yaml: Configures pre-commit hooks, which are used to enforce code style and quality checks before committing code.AGENTS.md: Provides guidelines and instructions for AI agents working with this repository.Makefile: Containsmakecommands for common development tasks such as installing dependencies, running the application, and cleaning the project.README.md: This file, providing an overview of the project.app.py: The entry point for the Streamlit web interface, providing an interactive UI for the application.config.yaml: A configuration file for setting up application parameters, such as model names and other settings.main.py: The main orchestrator script for the entire processing pipeline, from file discovery to analysis and database storage.pyproject.toml: The project's build and dependency management file, used byuv.pytest.ini: The configuration file forpytest, the testing framework used in this project.uv.lock: The lock file generated byuvto ensure deterministic dependency installation.
make install
make setup-ollama
make runThis launches the Streamlit web interface for interactive querying of your digital archive.
The project also includes command-line utilities for batch processing:
Create a manifest of files in a directory:
uv run python src/discover_files.py /path/to/directoryThis generates data/manifest.json with file metadata including paths, MIME
types, and sizes.
Run the complete extraction and analysis pipeline:
uv run python main.pyThis processes files from the manifest, extracts content, generates embeddings, and stores them in ChromaDB.
uv run python -m file_catalog --helpThe scan command takes a directory to catalog and writes a manifest file. The
analyze command runs follow-up analyses using the manifest. Both commands
currently print placeholders; fill in the implementation using the helpers in
src/file_catalog/.
The src/content_extractor.py module provides helpers for pulling text from
various file types:
preprocess_for_ocr(image): Prepares a PIL image for OCR by converting to grayscale and applying binary thresholding.extract_content_from_docx(file_path): Extracts all text from a .docx file using python-docx.extract_content_from_image(file_path): Opens an image with Pillow, preprocesses it for OCR, and extracts text using pytesseract.extract_content_from_pdf(file_path): Extracts text from PDFs using a hybrid approach: digital text first, OCR for scanned pages (detected by low text length).extract_frames_from_video(file_path, output_dir, interval_sec): Extracts frames from videos at specified intervals and saves them as JPEG images.
These functions are designed to be called from the main CLI or standalone scripts for content processing.
To run the complete cataloging and analysis pipeline:
uv run python main.pyThis script loads the manifest, processes each file through extraction and AI analysis stages, updates the database, and saves progress incrementally for resumability.
- Ensure system dependencies for OCR are installed (Tesseract, poppler for
pdf2image, etc.). On macOS, runmake check-tesseractto install automatically if needed. - Ollama must be running for AI analysis features. Run
make setup-ollamato start Ollama and download required models (llama3, deepseek-coder, llava). - To reset the project and start fresh, run
make cleanto remove the manifest file and ChromaDB database. - Torch and transformers may require additional system packages depending on your hardware. Consult their documentation for accelerated backends.
- Add standalone scripts to the
scripts/directory when workflows need bespoke orchestration outside the core CLI interface. - Manage Python dependencies with
uv: runuv add <package>to include new libraries anduv syncto refresh the virtual environment. - Use
uv run pre-commit installonce to activate git hooks, thenuv run pre-commit run --all-filesto verify formatting and linting locally.