ORIGINAL PROJECT: https://github.com/ANSSI-FR/AnoMark Created with Claude.ai but supervised by a human (me apparently) JUST FOR FUN.
Anomaly detection in command lines with Markov chains
A pure Rust implementation of the AnoMark algorithm for detecting malicious command lines using Markov Chains and n-grams.
- 🦀 Pure Rust - No Python dependencies, native performance
- 🚀 Fast execution - Leverages Rust's performance for large datasets
- 📊 Progress tracking - Visual progress bars for long-running operations
- 💾 Binary serialization - Efficient model storage with bincode
- 🎨 Colored output - Highlights anomalous characters in terminal
- 🔧 CLI tools - One
traincommand for CSV / JSONL / TXT character models;apply-model, token tools, etc. - Explainability - See which n-grams contributed to a low score (character or token level)
- Token-level models - Optional word/token n-gram models (e.g. by whitespace or path segments)
- Training exclusions - Drop Linux kernel-style names (
[nvme-wq]) and/or custom regex patterns before training
- Rust 1.70 or later
- Cargo (comes with Rust)
# Clone the repository
git clone <repo-url>
cd anomark-rust
# Build release binaries
cargo build --release
# Binaries will be in target/release/Format is auto-detected from each file’s extension (.csv, .jsonl, .txt). You can force a format with --format csv|jsonl|txt (e.g. a .log file as JSONL). Do not mix .txt with CSV/JSONL in the same run (train those separately).
# CSV
cargo run --bin train -- -d data/train_data.csv -c CommandLine -o 4 --placeholder
# Plain text (one corpus per file, concatenated)
cargo run --bin train -- -d data/train_data.txt -o 4
# JSONL (process events; `-c` defaults to `command` if omitted for JSONL-only input)
cargo run --bin train -- \
-d data/events.jsonl \
-c command \
-o 3 \
--filter event_type process \
--output models/process_model.bincargo run --bin apply-model -- \
-m models/model.bin \
-d data/test_data.csv \
-c CommandLine \
--store \
--color \
-n 100Linux and similar systems often report kernel threads with a command line that is only a bracketed name, e.g. [kthreadd], [nvme-wq]. Those strings are usually not useful for learning “normal” userland commands.
The train command and train-token-model support:
| Flag | Meaning |
|---|---|
--exclude-kernel-threads |
Drop lines where the entire command (after trim) matches [something] — one pair of brackets, no nested [/] inside. |
--exclude-regex <PATTERN> |
Drop lines matching this Rust regex (repeatable). Checked against the full command string (CSV/JSONL/token) or each line (TXT). |
Filtering runs after loading and before -n / -p slicing (so limits apply to the kept lines). If everything is excluded, the tool exits with an error. When applying the model, use the same --exclude-kernel-threads and/or --exclude-regex with apply-model (or apply-token-model) so those rows are skipped and not reported as anomalies.
# JSONL: train on process commands but skip kernel thread names
cargo run --bin train -- \
-d data/events.jsonl -c command -o 3 \
--filter event_type process \
--exclude-kernel-threads \
--output models/process_userland.bin
# CSV: kernel threads plus any extra pattern (e.g. lines containing "kworker")
cargo run --bin train -- \
-d data.csv -c CommandLine -o 4 \
--exclude-kernel-threads \
--exclude-regex 'kworker'Train a character-level Markov model from CSV, JSONL, and/or .txt in one CLI. With --format auto (default), each file’s type is inferred from its extension; directories collect matching files (under auto, all of .csv, .jsonl, and .txt in that tree). CSV and JSONL can be combined in one run (same -c field name). Plain .txt cannot be mixed with CSV/JSONL in the same invocation.
cargo run --bin train -- [OPTIONS] --data <PATH>... --order <NUM>
Options:
--format <FMT> auto | csv | jsonl | txt [default: auto]
-d, --data <PATH>... File(s) and/or directories (repeatable); use --recursive for subdirs
--recursive
-c, --column <NAME> CSV/JSONL field (required for CSV; JSONL-only defaults to command)
--filter <FIELD> <VALUE> JSONL only: keep lines where field equals value
-o, --order <NUM> N-gram order
--count-column <NAME> CSV only: per-row counts
--output <PATH>
-n, --n-lines <NUM> Subsample lines (after exclusions)
-p, --percentage <PCT>
--from-end, -r, --randomize
--placeholder, --filepath-placeholder
--resume, -m, --model
--parallel
--exclude-kernel-threads, --exclude-regex <PATTERN>Examples:
cargo run --bin train -- -d data.csv -c CommandLine -o 4
cargo run --bin train -- -d data.csv -c CommandLine -o 4 -n 1000 --placeholder
cargo run --bin train -- -d data/ -c CommandLine -o 4 --recursive
cargo run --bin train -- -d data/commands.txt -o 4
cargo run --bin train -- -d events.jsonl -o 3 --filter event_type process --output models/proc.bin
# Force JSONL for a file without .jsonl extension:
cargo run --bin train -- --format jsonl -d events.log -c command -o 3
# Merge two CSVs
cargo run --bin train -- -d a.csv -d b.csv -c CommandLine -o 4Apply a trained model to detect anomalies in new data. Input can be CSV or JSONL; format is auto-detected from the file extension (.jsonl → JSONL) or set with --format.
Mixed JSONL (different fields per line): If your JSONL has different event types (e.g. some lines with date, file_path, event_type and others with command, pid), the tool scans all rows to find the first one that has the requested column. Rows that don't have that field (e.g. file events when you use -c command) are skipped; at the end you'll see e.g. Skipped N rows missing field 'command' (e.g. other event types).
cargo run --bin apply-model -- [OPTIONS] --data <PATH> --model <PATH> --column <NAME>
Options:
-d, --data <PATH> Path to CSV or JSONL file to analyze
-m, --model <PATH> Path to trained model
-c, --column <NAME> Column/field to score (e.g. CommandLine for CSV, command for JSONL)
--format <FMT> Input format: csv, jsonl, or auto [default: auto]
-s, --store Save results to CSV
-o, --output <PATH> Custom output path
--color Highlight anomalous characters
-n, --n-lines <NUM> Number of results to display [default: 50]
--silent Suppress terminal output
--placeholder Apply placeholders to test data
--filepath-placeholder Apply filepath placeholders
--show-percentage Show anomaly percentage scores
--explain Show unusual n-grams for each result
--exclude-kernel-threads Skip rows whose command is [name]-style (e.g. [kthreadd]); use if you trained with --exclude-kernel-threads
--exclude-regex <PATTERN> Skip rows whose command matches this regex (repeatable)
--machine-field <COLUMN> Column/field that contains the machine/host name (e.g. hostname); output includes a Machine column for filtering
--machine <NAME> Use this value as Machine for every row (e.g. when input has no host column)Machine / host for filtering: Use --machine-field hostname (or your column name) so the CSV and terminal output include a Machine column; you can then filter or group by machine. If the input has no host column, use --machine server01 to tag all rows from this run.
Apply-time exclusions: If you trained with --exclude-kernel-threads (or --exclude-regex), use the same flags when applying so those rows are skipped and not reported as anomalies. Otherwise kernel threads (e.g. [nvme-wq]) will appear with low scores because they were never in the training set.
Suspect commands: Each printed line is labeled SUSPECT (this command is flagged as unusual) or not flagged using markovScore vs the model baseline (95% of prior log-probability). Exported CSV includes a Suspect column (yes / no) right after markovScore. Results stay sorted with the most unusual first (#1).
Examples (CSV):
# Basic execution with colored output
cargo run --bin apply-model -- -m models/model.bin -d test.csv -c CommandLine --color
# Save results and show top 100 anomalies
cargo run --bin apply-model -- -m models/model.bin -d test.csv -c CommandLine -s -n 100
# Apply with placeholders and percentage scores
cargo run --bin apply-model -- -m models/model.bin -d test.csv -c CommandLine \
--placeholder --show-percentage --store
# Explain why a command is anomalous (show unusual n-grams)
cargo run --bin apply-model -- -m models/model.bin -d test.csv -c CommandLine --explain -n 20Examples (JSONL):
# Run on JSONL (e.g. process events); format auto-detected from .jsonl extension
cargo run --bin apply-model -- -m models/process_model.bin -d data/events.jsonl -c command -n 20
# Force JSONL when file has no .jsonl extension
cargo run --bin apply-model -- -m models/process_model.bin -d data/events.log --format jsonl -c command
# JSONL with explain and store
cargo run --bin apply-model -- -m models/process_model.bin -d data/events.jsonl -c command --explain -s -o results/anomalies.csvYou can get explanations for why a command was scored as anomalous: the model reports which unusual n-grams (character or token sequences) had low probability and contributed to the low score.
- Character model: use
apply-modelwith--explain. The CLI prints unusual character n-grams for each result, and the CSV export includes an UnusualNgrams column (semicolon-separated list ofngram (log_prob)). - Token model: use
apply-token-modelwith--explainfor token-level explanations (e.g. which token transitions were rare).
How it works: The model scores each (order+1)-gram in the sequence. N-grams with log-probability below a threshold (e.g. 95% of the prior) are flagged as “unusual” and attached to the result. Lower log-probability means the transition was rare in training, so it contributes to anomaly.
Example (character model):
cargo run --bin apply-model -- -m models/demo_char.bin -d data/demo_logs.jsonl -c command -n 10 --explainOutput includes lines like:
unusual n-grams: "xyz"(-7.2), "ab"(-6.8), ...
Besides character n-grams, you can train a token-level Markov model (e.g. over words or path segments). This can help when anomalies are better expressed as “unusual token sequences” rather than unusual character sequences.
Train a token model (from JSONL or CSV). Same --exclude-kernel-threads and --exclude-regex as other trainers:
cargo run --bin train-token-model -- -d data/commands.jsonl -c command -o 2 --tokenizer whitespace --output models/token.bin
cargo run --bin train-token-model -- -d data/commands.jsonl -c command -o 2 --exclude-kernel-threads --output models/token.binTokenizer options:
whitespace– split on spaces (default)path– split on/and\, keeping separators as tokenswhitespace_and_path– path split, then whitespace within segments
Apply the token model:
cargo run --bin apply-token-model -- -m models/token.bin -d data/test.jsonl -c command -n 20 --explainToken models use the same explainability as the character model: with --explain, results include unusual token transitions (e.g. "curl -> http" (-5.1)).
Demo (generates data, trains both character and token models, runs detection with explain):
./demo_explain_and_token.shShow summary statistics for a saved model (character or token). The tool tries to load as a character model first, then as a token model.
cargo run --bin inspect-model -- -m models/my_model.binHuman-readable output includes: file size, model type, order, prior, whether the chain is trained, number of context n-grams, transition count, and (for character models) alphabet size.
JSON (for scripts / dashboards):
cargo run --bin inspect-model -- -m models/my_model.bin --jsonFrom Rust you can also use MarkovModel::num_contexts(), num_transitions(), alphabet_len(), and the same on TokenMarkovModel (num_contexts, num_transitions) after loading with ModelHandler::load_model / load_token_model.
The --placeholder flag replaces common variable elements with placeholders to reduce false positives:
| Pattern | Regex | Placeholder |
|---|---|---|
| GUID | \{?[0-9A-Fa-f]{8}[-–]([0-9A-Fa-f]{4}[-–]){3}[0-9A-Fa-f]{12}\}? |
<GUID> |
| SID | S[-–]1[-–]([0-9]+[-–])+[0-9]+ |
<SID> |
| User Path | (C:\\Users)\\[^\\]*\\ |
<USER> |
| Hash | \b(?:[A-Fa-f0-9]{64}|[A-Fa-f0-9]{40}|[A-Fa-f0-9]{32}|[A-Fa-f0-9]{20})\b |
<HASH> |
The --filepath-placeholder replaces full file paths with <FILEPATH>. Use cautiously as it may affect true positive detection.
You can also use AnoMark as a Rust library:
use anomark::{ModelHandler, MarkovModel};
fn main() -> anyhow::Result<()> {
// Train a model
let training_data = "normal command line patterns...";
let mut model = ModelHandler::train_from_txt(training_data, 4, None)?;
model.normalize_model_and_compute_prior();
// Score new data
let test_text = "suspicious command";
let score = model.log_likelihood(test_text);
println!("Anomaly score: {}", score);
Ok(())
}The Rust version offers significant performance improvements:
- Training: ~5-10x faster than Python
- Execution: ~3-5x faster than Python
- Memory: ~30-50% less memory usage
- Binary size: Compiled models are more compact
# Run all tests
cargo test
# Run with output
cargo test -- --nocapture
# Run specific test
cargo test test_train_from_txt# Check code
cargo check
# Format code
cargo fmt
# Lint code
cargo clippy
# Build documentation
cargo doc --openAnoMark uses character-level n-grams to build a Markov chain model:
- Training: Learns transition probabilities between character sequences
- Normalization: Converts counts to probabilities and computes prior
- Scoring: Computes average log-likelihood for test sequences
- Detection: Lower scores indicate more anomalous patterns
The order parameter determines the context window size (typically 3-5 characters).
- Original Python implementation: ANSSI-FR/AnoMark
- SSTIC 2022 Presentation (French): Link
- FIRST 2023 Conference (English): Video
Issue: Column/field '…' not found (or old message Column not found) when running apply-model
- JSONL: Field names are case-sensitive in the file, but
-c/--columnis matched case-insensitively (e.g.-c Commandmatches"command"). If it still fails, the error lists available fields from the first row—use one of those names exactly. - Wrong format: If the path does not end in
.jsonlbut the file is JSONL, add--format jsonl. - CSV: Use the exact header name from the first line (whitespace matters); case-insensitive matching also applies.
Issue: Model file not found
# Ensure models directory exists
mkdir -p modelsIssue: CSV parsing errors
# Check CSV format and column names
cargo run --bin apply-model -- --helpIssue: Out of memory during training
# Use line limiting flags
cargo run --bin train -- -d data.csv -c CommandLine -o 4 -n 10000