High-performance behavior intelligence engine for Rust
Vec-Eyes Core is a modular, high-performance behavior classification engine written in Rust, designed to power advanced detection systems across security, data science, and biological domains.
It combines machine learning, NLP, vector embeddings, and rule-based matching into a unified engine for pattern detection and classification.
Vec-Eyes Core is the engine behind Vec-Eyes CLI.
It provides:
- π§ Machine Learning (KNN, Naive Bayes)
- π‘ NLP pipelines (Tokenization, TF-IDF, Embeddings)
- β‘ Vector similarity (Word2Vec, FastText)
- π Rule engine (Regex / optional VectorScan)
- π Hybrid scoring system
- Spam classification
- Phishing detection
- Web attack identification (SQLi, XSS, fuzzing)
- Malware behavior analysis
- Log anomaly detection
- Transaction anomaly detection
- Behavioral fraud patterns
- Suspicious activity classification
Vec-Eyes Core can be adapted for:
- Virus pattern classification
- Human / biological signal classification
- Bacteria and fungus pattern detection
- Biomedical text/log classification
Input Text / Data
β
Normalization / Tokenization
β
Feature Extraction (TF-IDF / Embeddings)
β
ML Engine (classifiers)
β
Rule Engine (Regex / VectorScan)
β
Hybrid Scoring
β
Final Classification
- KNN (Cosine, Euclidean, Manhattan, Minkowski)
- Naive Bayes (Count, TF-IDF)
- Logistic Regression
- SVM (Linear, RBF, Polynomial, Sigmoid)
- Random Forest (Standard, Balanced, ExtraTrees + OOB)
- Gradient Boosting
- Isolation Forest
- Tokenization
- Normalization
- TF-IDF
- Word2Vec (lightweight training)
- FastText-style embeddings (subword support)
Vec-Eyes supports rule-based matching with scoring.
- Regex-based matcher
- VectorScan (Hyperscan fork for high-speed matching)
Vec-Eyes allows full pipeline definition via YAML.
The examples below are designed to be:
- realistic
- maintainable
- clear for contributors
- close to production usage
A strong default for email classification and noisy text detection.
method: KnnCosine
nlp: FastText
k: 5
threads: 4
datasets:
hot:
- /data/email/spam/
cold:
- /data/email/normal/
rules:
- title: Spam Keywords
description: Detect common spam patterns
match_rule: "free|bonus|win|casino|urgent"
score: 70
- title: Suspicious URL
description: Detect promotional or deceptive links
match_rule: "http://.*(promo|deal|bonus)"
score: 80- spam detection
- phishing-like text
- noisy or typo-heavy messages
- unstructured text with strong lexical patterns
Useful for request classification, payload similarity, and attack family grouping.
method: KnnEuclidean
nlp: Word2Vec
k: 3
threads: 4
datasets:
hot:
- /data/http/attacks/
cold:
- /data/http/normal/
rules:
- title: SQL Injection Pattern
description: Common SQLi fragments
match_rule: "union select|or 1=1|information_schema"
score: 90
- title: XSS Attempt
description: Typical XSS payload markers
match_rule: "<script>|alert\(|onerror="
score: 85- HTTP request classification
- attack similarity analysis
- fuzzing / malicious payload detection
A simple, fast baseline for suspicious transaction narratives and fraud-related documents.
method: Bayes
nlp: TfIdf
threads: 2
datasets:
hot:
- /data/fraud/transactions/
cold:
- /data/legit/transactions/
rules:
- title: Suspicious Transaction
description: Transaction language associated with urgency or manipulation
match_rule: "transfer|urgent|wire|immediate"
score: 60
- title: Known Fraud Pattern
description: Indicators of laundering, anonymity, or offshore movement
match_rule: "offshore|crypto|anonymous|shell company"
score: 75- fraud screening
- suspicious transaction review
- baseline text classification for risk teams
A lightweight example for biological text grouping and domain-specific keyword reinforcement.
method: KnnCosine
nlp: FastText
k: 4
threads: 4
datasets:
hot:
- /data/bio/virus/
cold:
- /data/bio/human/
rules:
- title: Virus Signature
description: Vocabulary linked to viral sequences and mutations
match_rule: "rna|mutation|viral|capsid"
score: 80
- title: Human Marker
description: Terms associated with normal human biological context
match_rule: "human tissue|somatic|host response"
score: 20- biological text classification
- biosignal labeling
- domain experiments in genomics / virology corpora
Example of a richer structured model configuration.
method: RandomForest
nlp: FastText
threads: 8
random_forest_mode: ExtraTrees
random_forest_n_trees: 200
random_forest_max_depth: null
random_forest_max_features: sqrt
random_forest_min_samples_split: 2
random_forest_min_samples_leaf: 1
random_forest_bootstrap: true
random_forest_oob_score: true
datasets:
hot:
- /data/http/attacks/
cold:
- /data/http/normal/
rules:
- title: High Risk Attack Rule
match_rule: "union select|<script>|../|xp_cmdshell"
score: 90- structured or semi-structured risk signals
- richer classification experiments
- Random Forest benchmarking
- OOB-based internal validation
A clean example for more advanced text classification.
method: SVM
nlp: TfIdf
threads: 4
svm_kernel: Linear
svm_c: 1.0
svm_learning_rate: 0.01
svm_epochs: 50
datasets:
hot:
- /data/email/spam/
cold:
- /data/email/normal/
rules:
- title: Spam Promotion Rule
match_rule: "bonus|prize|winner|cash"
score: 50LinearRbfPolynomialSigmoid
Good for more structured scoring scenarios.
method: GradientBoosting
nlp: TfIdf
threads: 4
gradient_boosting_n_estimators: 100
gradient_boosting_learning_rate: 0.1
gradient_boosting_max_depth: 3
datasets:
hot:
- /data/fraud/high-risk/
cold:
- /data/fraud/low-risk/
rules:
- title: High Risk Pattern
match_rule: "urgent transfer|offshore|anonymous wallet"
score: 65Best suited when your main signal is βnormal vs strangeβ.
method: IsolationForest
nlp: FastText
threads: 4
isolation_forest_n_trees: 150
isolation_forest_contamination: 0.02
isolation_forest_subsample_size: 256
datasets:
hot:
- /data/anomaly/known_outliers/
cold:
- /data/anomaly/normal/
rules:
- title: Rare Pattern
match_rule: "unexpected syscall|rare endpoint|unusual payload"
score: 40use vec_eyes_lib::{ClassifierFactory, MethodKind, NlpOption};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let classifier = ClassifierFactory::new()
.method(MethodKind::KnnCosine)
.nlp(NlpOption::FastText)
.k(Some(5))
.threads(Some(4))
.build()?;
let result = classifier.classify_text("claim your free casino bonus now")?;
println!("{result:?}");
Ok(())
}use vec_eyes_lib::{ClassifierFactory, MethodKind, NlpOption};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let classifier = ClassifierFactory::new()
.method(MethodKind::Bayes)
.nlp(NlpOption::TfIdf)
.threads(Some(2))
.build()?;
let result = classifier.classify_text("urgent offshore transfer to anonymous account")?;
println!("{result:?}");
Ok(())
}use vec_eyes_lib::{
ClassifierFactory,
MethodKind,
NlpOption,
RandomForestMaxFeatures,
RandomForestMode,
};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let classifier = ClassifierFactory::new()
.method(MethodKind::RandomForest)
.nlp(NlpOption::FastText)
.threads(Some(8))
.random_forest_mode(Some(RandomForestMode::ExtraTrees))
.random_forest_n_trees(Some(200))
.random_forest_max_features(Some(RandomForestMaxFeatures::Sqrt))
.random_forest_bootstrap(Some(true))
.random_forest_oob_score(Some(true))
.build()?;
let result = classifier.classify_text("union select password from users where 1=1")?;
println!("{result:?}");
Ok(())
}use vec_eyes_lib::{ClassifierFactory, MethodKind, NlpOption, SvmKernel};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let classifier = ClassifierFactory::new()
.method(MethodKind::SVM)
.nlp(NlpOption::TfIdf)
.threads(Some(4))
.svm_kernel(Some(SvmKernel::Linear))
.svm_c(Some(1.0))
.svm_learning_rate(Some(0.01))
.svm_epochs(Some(50))
.build()?;
let result = classifier.classify_text("win cash now bonus offer")?;
println!("{result:?}");
Ok(())
}The matrix below is designed to make Vec-Eyes easier to understand and safer to configure. It summarizes what each classifier is best at, which NLP representations fit best, which parameters are required, and which non-trivial parameters strongly affect results.
| Classifier | Best NLP / Feature Input | Required Parameters | Important Non-Trivial Parameters | Best Use Cases | Notes |
|---|---|---|---|---|---|
| Bayes | Count, TfIdf |
None | threads |
Spam detection, fast baseline text classification, simple fraud text screening | Very fast and stable. Best as a baseline. Not ideal for dense embeddings like Word2Vec/FastText. |
| KnnCosine | FastText, Word2Vec |
k |
threads |
Similarity-based classification, noisy text, phishing, behavioral text matching | Strong default for embedding-based text classification. Cosine is usually the best first KNN metric for dense vectors. |
| KnnEuclidean | FastText, Word2Vec |
k |
threads |
Distance-based embedding experiments, attack clustering | More sensitive to magnitude than cosine. Useful for controlled experiments. |
| KnnManhattan | FastText, Word2Vec |
k |
threads |
Alternative distance profile for embeddings | Often used for experimentation rather than as the default production KNN choice. |
| KnnMinkowski | FastText, Word2Vec |
k, p |
threads |
Research-style distance tuning, anomaly/similarity experiments | p changes the geometry of distance. Use only when you explicitly want to tune distance behavior. |
| LogisticRegression | TfIdf, Count, sometimes dense embeddings |
logistic_learning_rate, logistic_epochs |
logistic_lambda, threads |
Fraud classification, text classification, strong production baseline | Great balance between interpretability, speed, and quality. Very practical model. |
| SVM | TfIdf, Count, sometimes dense embeddings |
svm_kernel, svm_c |
svm_learning_rate, svm_epochs, svm_gamma, svm_degree, svm_coef0, threads |
Security text classification, spam, fraud, web attack text | Linear is usually the best first choice. Rbf is more expressive but more sensitive to tuning. |
| RandomForest | TfIdf, FastText, structured-ish feature sets |
random_forest_n_trees |
random_forest_mode, random_forest_max_depth, random_forest_max_features, random_forest_min_samples_split, random_forest_min_samples_leaf, random_forest_bootstrap, random_forest_oob_score, threads |
Richer risk scoring, structured features, fraud, mixed-signal classification | Good when you want ensembles and model diversity. Supports Standard, Balanced, and ExtraTrees. |
| GradientBoosting | TfIdf, structured-ish feature sets |
gradient_boosting_n_estimators, gradient_boosting_learning_rate |
gradient_boosting_max_depth, threads |
Fraud/risk scoring, more expressive tabular-like classification | More sensitive to hyperparameters than Bayes or Logistic Regression. |
| IsolationForest | FastText, Word2Vec, anomaly-oriented feature sets |
isolation_forest_n_trees, isolation_forest_contamination |
isolation_forest_subsample_size, threads |
Anomaly detection, unusual behavior detection, outlier hunting | Best when the goal is finding what looks abnormal rather than choosing among many known labels. |
hotdirectories β labeled as target classcolddirectories β baseline / normal behavior- All files are read recursively
- Multiple directories supported
- Each file contributes to training vectors
sudo dnf install boost-devel cmake gcc gcc-c++sudo apt install libboost-all-dev cmake build-essentialcargo build --features vectorscan- KnnCosine
- KnnEuclidean
- KnnManhattan
- KnnMinkowski (requires
p) - Bayes
- KNN requires
k - Minkowski requires
p - Bayes does not require extra parameters
- YAML is validated before execution ...
- Rust-native
- Rayon parallelism
- ndarray + BLAS ready
use vec_eyes_core::*;
let classifier = build_classifier(...)
.with_method(MethodKind::KnnCosine { k: 5 })
.with_nlp(NlpOption::FastText)
.load_rules("rules.yaml")
.train(datasets)?;
let result = classifier.classify("input data");vec-eyes-libβ core enginevec-eyes-cliβ interface layer
https://github.com/Orangewarrior/vec-eyes-lib/blob/main/tests/tests.md https://github.com/Orangewarrior/vec-eyes-lib/wiki/%F0%9F%A7%AA-Testing-Guide
We welcome contributions in:
- ML improvements
- Performance optimization
- Rule engine enhancements
- Dataset integrations
- Biological classification extensions
Orangewarrior
If you like Vec-Eyes:
- β Star the repo
- π‘ Open issues
- π§ Contribute