Skip to content

christyjacob4/mcat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

56 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🐱 mcat

cat on steroids β€” a drop-in cat replacement that understands Parquet, Avro, CSV, JSONL, and remote sources.

License: MIT Python 3.10+ GitHub Stars


Why mcat?

cat is everywhere, but it can't read Parquet or Avro. The existing tools (parquet-cli, avro-tools) are heavy Java dependencies that take ages to install. mcat is a single pip install (or uv tool install) that just works -- all GNU cat flags, plus structured format support and remote sources out of the box.


Install

# With uv (recommended)
uv tool install mcat

# Or with pip
pip install mcat

# With Homebrew
brew tap christyjacob4/tap
brew install mcat

Usage

mcat works exactly like cat β€” all the same flags work:

mcat file.txt                    # Same as cat
mcat -n file.txt                 # Number lines
mcat -b -s file.txt              # Number non-blank, squeeze blanks
mcat -A file.txt                 # Show all (tabs, ends, non-printing)
echo "hello" | mcat              # Stdin passthrough

But it also understands structured data:

mcat data.parquet                # Pretty table output
mcat data.parquet --format jsonl # As JSON Lines
mcat data.csv                    # CSV as table
mcat data.jsonl --head 10        # First 10 records
mcat data.parquet --schema       # Print schema only
mcat data.parquet --columns name,age  # Select columns
mcat data.parquet --grep "Smith"  # Rows where any column matches "Smith"
mcat data.csv --grep "^A" --columns name  # Names starting with A
mcat data.parquet --grep "2024" --format jsonl  # Rows mentioning 2024
mcat data.parquet --count        # Row count (instant for Parquet)
mcat data.parquet --sample 10    # Random 10 rows
mcat data.csv --sample 5 --format jsonl  # 5 random rows as JSONL
mcat data.parquet --detect       # Print detected format
mcat data.parquet --sort age           # Sort ascending by age
mcat data.parquet --sort -age          # Sort descending by age
mcat data.csv --sort "region,-sales"   # Multi-column sort
mcat data.csv --sort name --head 10    # Sort + head

Comparing two structured files:

mcat --diff old.csv new.csv
mcat --diff prod.parquet staging.parquet --columns name,age

Column statistics (instant for Parquet β€” reads metadata only):

mcat --stats data.parquet
mcat --stats --columns age,salary data.parquet   # specific columns only

Transparent compression (gzip, zstd, bz2, lz4, xz β€” all work):

mcat data.parquet.gz
mcat s3://bucket/logs.jsonl.zst --head 100
mcat data.csv.bz2 --stats

And remote sources (streaming, no full download):

mcat s3://bucket/data.parquet
mcat gs://bucket/data.parquet
mcat https://example.com/data.csv

# S3-compatible storage (MinIO, Cloudflare R2, Backblaze B2, DigitalOcean Spaces)
mcat --s3-endpoint https://play.min.io s3://mybucket/data.parquet

Format conversion with --output:

mcat data.parquet --format jsonl --output data.jsonl
mcat data.csv --format jsonl --output data.jsonl

Pager support for large output (respects $PAGER, defaults to less -R):

mcat large_data.parquet --pager          # view in pager
mcat data.csv --pager                     # page through CSV table
PAGER="more" mcat data.parquet --pager   # use 'more' instead of 'less'

Flag Reference

Flag Short Description
--number -n Number all output lines
--number-nonblank -b Number non-blank lines only
--squeeze-blank -s Squeeze multiple blank lines
--show-all -A Equivalent to -vET
--show-ends -E Display $ at end of each line
--show-tabs -T Display TAB as ^I
--show-nonprinting -v Use ^ and M- notation
-e Equivalent to -vE
-t Equivalent to -vT
--format Output format: table | jsonl | csv | raw
--head Show first N rows
--tail Show last N rows
--schema Print schema only
--columns Comma-separated column names
--grep Filter rows where any column matches pattern (regex)
--sample Random sample of N rows
--count -c Print row count only
--sort Sort by column(s), prefix with - for descending
--query Filter with SQL WHERE clause (powered by DuckDB)
--stats Print column statistics summary
--diff Compare two structured files side by side
--detect Print detected format and exit
--output -o Write output to file instead of stdout
--pager Pipe output through pager (less/more)
--s3-endpoint Custom S3 endpoint URL (MinIO, R2, B2, Spaces)
--version -V Show version

Format Support

Format Extensions Features
Parquet .parquet, .pq Stream row groups, schema inspect
Avro .avro Stream blocks
JSONL .jsonl, .ndjson Pretty-print each record
CSV .csv Table with headers
TSV .tsv Table with headers
Excel .xlsx, .xls First sheet
JSON .json Array of objects or single object

Formats are detected by extension first, then by magic bytes (PAR1, Obj\x01) as a fallback.

Output Formats

Use --format to control output:

  • table (default) β€” Rich formatted table
  • jsonl β€” One JSON object per line
  • csv β€” CSV with headers
  • raw β€” Python repr

Authentication

mcat uses zero-config auth β€” it piggybacks on credentials you've already set up for your cloud provider. No mcat-specific credential flags needed.

AWS S3

aws configure   # one-time setup β†’ works everywhere
mcat s3://my-bucket/data.parquet

All standard AWS auth methods work automatically: ~/.aws/credentials, env vars (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), named profiles (AWS_PROFILE), IAM roles, SSO, etc.

Google Cloud Storage

gcloud auth application-default login   # one-time setup
mcat gs://my-bucket/data.parquet

Also supports GOOGLE_APPLICATION_CREDENTIALS for service account keys.

Azure Blob Storage

# Set env vars once
export AZURE_STORAGE_ACCOUNT_NAME=myaccount
export AZURE_STORAGE_ACCOUNT_KEY=...
mcat az://mycontainer/data.parquet

Also works with az login and DefaultAzureCredential.

S3-Compatible Storage (MinIO, Cloudflare R2, Backblaze B2, DigitalOcean Spaces, Wasabi, Vultr)

# Option 1: AWS_ENDPOINT_URL env var (recommended β€” boto3/botocore 1.29+ official)
export AWS_ENDPOINT_URL=https://play.min.io
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
mcat s3://mybucket/data.parquet

# Option 2: Named profile in ~/.aws/config
# [profile minio]
# endpoint_url = https://play.min.io
# aws_access_key_id = minioadmin
# aws_secret_access_key = minioadmin
AWS_PROFILE=minio mcat s3://mybucket/data.parquet

# Option 3: Per-command --s3-endpoint override
mcat --s3-endpoint https://play.min.io s3://mybucket/data.parquet

License

MIT

About

cat on steroids - Parquet, Avro, ORC, CSV, JSONL and remote sources (S3, GCS, HTTP)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors