Skip to content

dragonGR/flowfilter

Repository files navigation

flowfilter

A command-line tool that filters JSON, CSV, and log files. One tool, one syntax, any format.

Instead of juggling grep, awk, and jq with different syntax for each, flowfilter lets you write a single query that works across all three formats. The query language borrows from SQL — not because it talks to a database, but because most developers already know how WHERE and SELECT work.

How it works

You feed it a file (or pipe data via stdin), and write a query that references the field names in your data. That's the key idea — the field names in the query come directly from the fields in your JSON objects, CSV column headers, or parsed log fields.

Here's a JSON file with user records. Each record has fields called name, age, role, and active:

{"name": "Alice", "age": 32, "role": "admin", "active": true}
{"name": "Bob", "age": 24, "role": "user", "active": true}
{"name": "Charlie", "age": 45, "role": "admin", "active": false}
{"name": "Diana", "age": 28, "role": "user", "active": true}
{"name": "Eve", "age": 19, "role": "user", "active": false}

To filter this, you reference those field names in a WHERE clause:

$ flowfilter 'WHERE age > 25 AND active = true' users.json
{"active":true,"age":32,"name":"Alice","role":"admin"}
{"active":true,"age":28,"name":"Diana","role":"user"}

age, active, name, role — those aren't keywords. They're the field names from the JSON above. If your data had a field called price, you'd write WHERE price > 100.

You can also pick which fields to include in the output with SELECT:

$ flowfilter 'WHERE age > 25 AND active = true SELECT name, age' users.json
{"age":32,"name":"Alice"}
{"age":28,"name":"Diana"}

Or just count the matches:

$ flowfilter 'WHERE role = "admin"' --count users.json
2

Same syntax for CSV and logs

The exact same query style works on CSV files. Here, the field names come from the CSV column headers (product, price, quantity, category):

product,price,quantity,category
Widget A,29.99,100,electronics
Widget B,149.50,25,electronics
Gadget D,75.00,50,accessories
Service E,199.99,10,services
$ flowfilter 'WHERE price > 50' sales.csv
product,price,quantity,category
Widget B,149.5,25,electronics
Gadget D,75,50,accessories
Service E,199.99,10,services

And log files work too. flowfilter parses Apache/Nginx logs into fields like method, path, status, remote_host, etc., so you can query them by name:

192.168.1.1 - - [10/Oct/2024:13:55:36 +0000] "GET /index.html HTTP/1.1" 200 2326
10.0.0.1 - - [10/Oct/2024:13:57:01 +0000] "GET /favicon.ico HTTP/1.1" 404 0
192.168.1.1 - - [10/Oct/2024:13:58:22 +0000] "DELETE /api/users/5 HTTP/1.1" 500 128
$ flowfilter 'WHERE status >= 400 SELECT method, path, status' access.log
{"method":"GET","path":"/favicon.ico","status":404}
{"method":"DELETE","path":"/api/users/5","status":500}

You don't need to tell flowfilter what format your file is — it auto-detects from the content and file extension.

Installation

From source

You'll need Rust 1.94 or later.

git clone https://github.com/dragonGR/flowfilter.git
cd flowfilter
cargo build --release
cp target/release/flowfilter ~/.local/bin/  # or wherever you keep binaries

Query syntax

The query language is intentionally simple. If you've written a SQL WHERE clause, you already know how to use it.

Basic filtering

flowfilter 'WHERE field = value'
flowfilter 'WHERE age > 25'
flowfilter 'WHERE name = "Alice"'
flowfilter 'WHERE active = true'

Operators

Operator Example Description
= WHERE status = "ok" Equal
!= WHERE status != "error" Not equal
> < >= <= WHERE age >= 18 Numeric/string comparison
LIKE WHERE name LIKE "A%" Pattern match (% = any, _ = single char)
IN WHERE status IN ("ok", "pending") Match any value in list
IS NULL WHERE email IS NULL Field is null or missing
IS NOT NULL WHERE email IS NOT NULL Field exists and isn't null

Combining conditions

# AND - both must be true
flowfilter 'WHERE age > 25 AND active = true'

# OR - either can be true
flowfilter 'WHERE role = "admin" OR role = "superuser"'

# NOT - negate a condition
flowfilter 'WHERE NOT status = "deleted"'

# Parentheses for grouping
flowfilter 'WHERE (age > 25 OR vip = true) AND active = true'

Selecting fields

Use SELECT to pick specific fields from the output instead of getting the whole record back:

flowfilter 'WHERE age > 25 SELECT name, email'
flowfilter 'SELECT name, age'  # no filter, just project fields

Nested fields

Dot notation works for nested objects:

flowfilter 'WHERE user.address.city = "New York"'
flowfilter 'WHERE user.address.city = "NYC" SELECT user.name, user.address.zip'

Supported formats

JSON

Handles both newline-delimited JSON (one object per line) and JSON arrays. Auto-detected.

# NDJSON
cat records.jsonl | flowfilter 'WHERE status = "active"'

# JSON array
flowfilter 'WHERE id > 100' data.json

CSV

Auto-detects headers from the first row. Supports custom delimiters and headerless files.

# Standard CSV
flowfilter 'WHERE price > 50' products.csv

# Tab-separated
flowfilter 'WHERE score >= 90' --delimiter $'\t' results.tsv

# No header row (fields become col0, col1, col2...)
flowfilter 'WHERE col1 > 25' --no-header data.csv

Note: CSV fields are strings internally. When you compare a field to a number (WHERE price > 50), flowfilter automatically tries to parse the string as a number. This just works in practice.

Log files

Built-in patterns for Apache Combined and syslog formats. You can also supply your own regex.

# Apache/Nginx access logs (auto-detected)
flowfilter 'WHERE status >= 400' access.log

# Syslog
flowfilter 'WHERE program = "sshd"' /var/log/syslog

# Custom pattern with named capture groups
flowfilter 'WHERE level = "ERROR"' \
  --log-pattern '(?P<timestamp>\S+) (?P<level>\S+) (?P<message>.*)' \
  app.log

Log fields depend on the pattern. Apache logs give you remote_host, user, timestamp, method, path, status, size. Syslog gives you timestamp, hostname, program, pid, message.

Output options

# JSON output (default for JSON/CSV input)
flowfilter 'WHERE age > 25' data.json

# Pretty table
flowfilter 'WHERE age > 25' -o table data.json

# CSV output
flowfilter 'WHERE age > 25' -o csv data.json

# Raw (one value per line)
flowfilter 'WHERE age > 25' -o raw data.json

Useful flags

-f, --format <auto|json|csv|log>     Force input format (otherwise auto-detected)
-o, --output <auto|json|csv|table|raw>  Choose output format
    --delimiter <CHAR>                CSV delimiter (default: comma)
    --no-header                       CSV has no header row
    --log-pattern <REGEX>             Custom log regex with named groups
-c, --count                           Just print how many records matched
    --first <N>                       Stop after N matches
    --last <N>                        Show only the last N matches
    --stats                           Print field statistics instead of records
    --no-color                        Disable colored output

Stats mode

The --stats flag gives you a quick overview of the matching data instead of dumping every record:

$ flowfilter 'WHERE active = true' --stats users.json
Field       Count  Nulls  Unique  Min    Max    Mean
name        3      0      3       -      -      -
age         3      0      3       28.0   32.0   30.0
email       3      0      3       -      -      -
active      3      0      1       -      -      -

How it works

Internally, flowfilter processes data as a streaming pipeline:

  1. Parse the query expression into an AST
  2. Read input one record at a time (constant memory usage)
  3. Evaluate the filter against each record
  4. Project selected fields (if using SELECT)
  5. Write matching records to stdout

This means you can pipe arbitrarily large files through it without worrying about memory. A 10GB log file uses the same amount of RAM as a 10KB one.

The input format is auto-detected from file extensions and content sniffing, but you can always override it with --format.

Performance

flowfilter is written in Rust with zero-copy parsing where possible. It processes data in a single pass with no buffering (except for --last which uses a ring buffer, and table output which needs to compute column widths).

On a typical machine, expect throughput in the hundreds of MB/s range for simple filters. The hand-written recursive descent parser adds negligible overhead compared to the I/O.

Building from source

git clone https://github.com/dragonGR/flowfilter.git
cd flowfilter

# Run tests
cargo test

# Run clippy
cargo clippy

# Build release binary
cargo build --release

# Run benchmarks
cargo bench

License

MIT

About

A command-line tool that filters JSON, CSV, and log files. One tool, one syntax, any format.

Resources

License

Contributing

Stars

Watchers

Forks

Contributors

Languages