Skip to content

Orangewarrior/vec-eyes-lib

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Vec-Eyes Core πŸ§ πŸ”¬

High-performance behavior intelligence engine for Rust

Vec-Eyes Core is a modular, high-performance behavior classification engine written in Rust, designed to power advanced detection systems across security, data science, and biological domains.

It combines machine learning, NLP, vector embeddings, and rule-based matching into a unified engine for pattern detection and classification.


πŸš€ What is Vec-Eyes Core?

Vec-Eyes Core is the engine behind Vec-Eyes CLI.

It provides:

  • 🧠 Machine Learning (KNN, Naive Bayes)
  • πŸ”‘ NLP pipelines (Tokenization, TF-IDF, Embeddings)
  • ⚑ Vector similarity (Word2Vec, FastText)
  • πŸ”Ž Rule engine (Regex / optional VectorScan)
  • πŸ“Š Hybrid scoring system

🎯 Use Cases

πŸ” Security & Threat Detection

  • Spam classification
  • Phishing detection
  • Web attack identification (SQLi, XSS, fuzzing)
  • Malware behavior analysis
  • Log anomaly detection

πŸ’° Fraud Detection

  • Transaction anomaly detection
  • Behavioral fraud patterns
  • Suspicious activity classification

🧬 Biological & Scientific Analysis

Vec-Eyes Core can be adapted for:

  • Virus pattern classification
  • Human / biological signal classification
  • Bacteria and fungus pattern detection
  • Biomedical text/log classification

βš™οΈ Core Architecture

Input Text / Data
        ↓
Normalization / Tokenization
        ↓
Feature Extraction (TF-IDF / Embeddings)
        ↓
ML Engine (classifiers)
        ↓
Rule Engine (Regex / VectorScan)
        ↓
Hybrid Scoring
        ↓
Final Classification

🧠 Machine Learning

βœ” Multiple ML Algorithms:

  • KNN (Cosine, Euclidean, Manhattan, Minkowski)
  • Naive Bayes (Count, TF-IDF)
  • Logistic Regression
  • SVM (Linear, RBF, Polynomial, Sigmoid)
  • Random Forest (Standard, Balanced, ExtraTrees + OOB)
  • Gradient Boosting
  • Isolation Forest

πŸ”‘ NLP Pipeline

  • Tokenization
  • Normalization
  • TF-IDF
  • Word2Vec (lightweight training)
  • FastText-style embeddings (subword support)

πŸ”Ž Rule Engine

Vec-Eyes supports rule-based matching with scoring.

βœ” Default (no dependencies)

  • Regex-based matcher

βœ” Optional (feature flag)

  • VectorScan (Hyperscan fork for high-speed matching)

πŸ”₯ YAML Pipeline (Advanced Examples)

Vec-Eyes allows full pipeline definition via YAML.


The examples below are designed to be:

  • realistic
  • maintainable
  • clear for contributors
  • close to production usage

1. YAML Examples

πŸ“„1.1 KNN + FastText for Spam / Security

A strong default for email classification and noisy text detection.

method: KnnCosine
nlp: FastText

k: 5
threads: 4

datasets:
  hot:
    - /data/email/spam/
  cold:
    - /data/email/normal/

rules:
  - title: Spam Keywords
    description: Detect common spam patterns
    match_rule: "free|bonus|win|casino|urgent"
    score: 70

  - title: Suspicious URL
    description: Detect promotional or deceptive links
    match_rule: "http://.*(promo|deal|bonus)"
    score: 80

When to use

  • spam detection
  • phishing-like text
  • noisy or typo-heavy messages
  • unstructured text with strong lexical patterns

πŸ“„1.2 KNN + Word2Vec for Web Attack Detection

Useful for request classification, payload similarity, and attack family grouping.

method: KnnEuclidean
nlp: Word2Vec

k: 3
threads: 4

datasets:
  hot:
    - /data/http/attacks/
  cold:
    - /data/http/normal/

rules:
  - title: SQL Injection Pattern
    description: Common SQLi fragments
    match_rule: "union select|or 1=1|information_schema"
    score: 90

  - title: XSS Attempt
    description: Typical XSS payload markers
    match_rule: "<script>|alert\(|onerror="
    score: 85

When to use

  • HTTP request classification
  • attack similarity analysis
  • fuzzing / malicious payload detection

πŸ“„1.3 Bayes + TF-IDF for Financial Fraud Text Classification

A simple, fast baseline for suspicious transaction narratives and fraud-related documents.

method: Bayes
nlp: TfIdf

threads: 2

datasets:
  hot:
    - /data/fraud/transactions/
  cold:
    - /data/legit/transactions/

rules:
  - title: Suspicious Transaction
    description: Transaction language associated with urgency or manipulation
    match_rule: "transfer|urgent|wire|immediate"
    score: 60

  - title: Known Fraud Pattern
    description: Indicators of laundering, anonymity, or offshore movement
    match_rule: "offshore|crypto|anonymous|shell company"
    score: 75

When to use

  • fraud screening
  • suspicious transaction review
  • baseline text classification for risk teams

πŸ“„1.4 Biological Classification with FastText

A lightweight example for biological text grouping and domain-specific keyword reinforcement.

method: KnnCosine
nlp: FastText

k: 4
threads: 4

datasets:
  hot:
    - /data/bio/virus/
  cold:
    - /data/bio/human/

rules:
  - title: Virus Signature
    description: Vocabulary linked to viral sequences and mutations
    match_rule: "rna|mutation|viral|capsid"
    score: 80

  - title: Human Marker
    description: Terms associated with normal human biological context
    match_rule: "human tissue|somatic|host response"
    score: 20

When to use

  • biological text classification
  • biosignal labeling
  • domain experiments in genomics / virology corpora

πŸ“„1.5 Random Forest + OOB + ExtraTrees

Example of a richer structured model configuration.

method: RandomForest
nlp: FastText

threads: 8

random_forest_mode: ExtraTrees
random_forest_n_trees: 200
random_forest_max_depth: null
random_forest_max_features: sqrt
random_forest_min_samples_split: 2
random_forest_min_samples_leaf: 1
random_forest_bootstrap: true
random_forest_oob_score: true

datasets:
  hot:
    - /data/http/attacks/
  cold:
    - /data/http/normal/

rules:
  - title: High Risk Attack Rule
    match_rule: "union select|<script>|../|xp_cmdshell"
    score: 90

When to use

  • structured or semi-structured risk signals
  • richer classification experiments
  • Random Forest benchmarking
  • OOB-based internal validation

πŸ“„1.6 SVM with Explicit Kernel Configuration

A clean example for more advanced text classification.

method: SVM
nlp: TfIdf

threads: 4

svm_kernel: Linear
svm_c: 1.0
svm_learning_rate: 0.01
svm_epochs: 50

datasets:
  hot:
    - /data/email/spam/
  cold:
    - /data/email/normal/

rules:
  - title: Spam Promotion Rule
    match_rule: "bonus|prize|winner|cash"
    score: 50

Other valid kernels

  • Linear
  • Rbf
  • Polynomial
  • Sigmoid

πŸ“„1.7 Gradient Boosting

Good for more structured scoring scenarios.

method: GradientBoosting
nlp: TfIdf

threads: 4

gradient_boosting_n_estimators: 100
gradient_boosting_learning_rate: 0.1
gradient_boosting_max_depth: 3

datasets:
  hot:
    - /data/fraud/high-risk/
  cold:
    - /data/fraud/low-risk/

rules:
  - title: High Risk Pattern
    match_rule: "urgent transfer|offshore|anonymous wallet"
    score: 65

πŸ“„1.8 Isolation Forest for Anomaly Detection

Best suited when your main signal is β€œnormal vs strange”.

method: IsolationForest
nlp: FastText

threads: 4

isolation_forest_n_trees: 150
isolation_forest_contamination: 0.02
isolation_forest_subsample_size: 256

datasets:
  hot:
    - /data/anomaly/known_outliers/
  cold:
    - /data/anomaly/normal/

rules:
  - title: Rare Pattern
    match_rule: "unexpected syscall|rare endpoint|unusual payload"
    score: 40

2. Rust API Examples

πŸ“„2.1 KNN + FastText

use vec_eyes_lib::{ClassifierFactory, MethodKind, NlpOption};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let classifier = ClassifierFactory::new()
        .method(MethodKind::KnnCosine)
        .nlp(NlpOption::FastText)
        .k(Some(5))
        .threads(Some(4))
        .build()?;

    let result = classifier.classify_text("claim your free casino bonus now")?;
    println!("{result:?}");

    Ok(())
}

πŸ“„2.2 Bayes + TF-IDF

use vec_eyes_lib::{ClassifierFactory, MethodKind, NlpOption};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let classifier = ClassifierFactory::new()
        .method(MethodKind::Bayes)
        .nlp(NlpOption::TfIdf)
        .threads(Some(2))
        .build()?;

    let result = classifier.classify_text("urgent offshore transfer to anonymous account")?;
    println!("{result:?}");

    Ok(())
}

πŸ“„2.3 Random Forest + Advanced Parameters

use vec_eyes_lib::{
    ClassifierFactory,
    MethodKind,
    NlpOption,
    RandomForestMaxFeatures,
    RandomForestMode,
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let classifier = ClassifierFactory::new()
        .method(MethodKind::RandomForest)
        .nlp(NlpOption::FastText)
        .threads(Some(8))
        .random_forest_mode(Some(RandomForestMode::ExtraTrees))
        .random_forest_n_trees(Some(200))
        .random_forest_max_features(Some(RandomForestMaxFeatures::Sqrt))
        .random_forest_bootstrap(Some(true))
        .random_forest_oob_score(Some(true))
        .build()?;

    let result = classifier.classify_text("union select password from users where 1=1")?;
    println!("{result:?}");

    Ok(())
}

πŸ“„ 2.4 SVM + Explicit Kernel

use vec_eyes_lib::{ClassifierFactory, MethodKind, NlpOption, SvmKernel};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let classifier = ClassifierFactory::new()
        .method(MethodKind::SVM)
        .nlp(NlpOption::TfIdf)
        .threads(Some(4))
        .svm_kernel(Some(SvmKernel::Linear))
        .svm_c(Some(1.0))
        .svm_learning_rate(Some(0.01))
        .svm_epochs(Some(50))
        .build()?;

    let result = classifier.classify_text("win cash now bonus offer")?;
    println!("{result:?}");

    Ok(())
}

Compatibility Matrix

The matrix below is designed to make Vec-Eyes easier to understand and safer to configure. It summarizes what each classifier is best at, which NLP representations fit best, which parameters are required, and which non-trivial parameters strongly affect results.

Classifier Best NLP / Feature Input Required Parameters Important Non-Trivial Parameters Best Use Cases Notes
Bayes Count, TfIdf None threads Spam detection, fast baseline text classification, simple fraud text screening Very fast and stable. Best as a baseline. Not ideal for dense embeddings like Word2Vec/FastText.
KnnCosine FastText, Word2Vec k threads Similarity-based classification, noisy text, phishing, behavioral text matching Strong default for embedding-based text classification. Cosine is usually the best first KNN metric for dense vectors.
KnnEuclidean FastText, Word2Vec k threads Distance-based embedding experiments, attack clustering More sensitive to magnitude than cosine. Useful for controlled experiments.
KnnManhattan FastText, Word2Vec k threads Alternative distance profile for embeddings Often used for experimentation rather than as the default production KNN choice.
KnnMinkowski FastText, Word2Vec k, p threads Research-style distance tuning, anomaly/similarity experiments p changes the geometry of distance. Use only when you explicitly want to tune distance behavior.
LogisticRegression TfIdf, Count, sometimes dense embeddings logistic_learning_rate, logistic_epochs logistic_lambda, threads Fraud classification, text classification, strong production baseline Great balance between interpretability, speed, and quality. Very practical model.
SVM TfIdf, Count, sometimes dense embeddings svm_kernel, svm_c svm_learning_rate, svm_epochs, svm_gamma, svm_degree, svm_coef0, threads Security text classification, spam, fraud, web attack text Linear is usually the best first choice. Rbf is more expressive but more sensitive to tuning.
RandomForest TfIdf, FastText, structured-ish feature sets random_forest_n_trees random_forest_mode, random_forest_max_depth, random_forest_max_features, random_forest_min_samples_split, random_forest_min_samples_leaf, random_forest_bootstrap, random_forest_oob_score, threads Richer risk scoring, structured features, fraud, mixed-signal classification Good when you want ensembles and model diversity. Supports Standard, Balanced, and ExtraTrees.
GradientBoosting TfIdf, structured-ish feature sets gradient_boosting_n_estimators, gradient_boosting_learning_rate gradient_boosting_max_depth, threads Fraud/risk scoring, more expressive tabular-like classification More sensitive to hyperparameters than Bayes or Logistic Regression.
IsolationForest FastText, Word2Vec, anomaly-oriented feature sets isolation_forest_n_trees, isolation_forest_contamination isolation_forest_subsample_size, threads Anomaly detection, unusual behavior detection, outlier hunting Best when the goal is finding what looks abnormal rather than choosing among many known labels.

🧠 How Dataset Loading Works

  • hot directories β†’ labeled as target class
  • cold directories β†’ baseline / normal behavior
  • All files are read recursively
  • Multiple directories supported
  • Each file contributes to training vectors

πŸ”§ Optional VectorScan Support

Fedora

sudo dnf install boost-devel cmake gcc gcc-c++

Debian / Ubuntu

sudo apt install libboost-all-dev cmake build-essential
cargo build --features vectorscan

βš™οΈ Supported Methods

  • KnnCosine
  • KnnEuclidean
  • KnnManhattan
  • KnnMinkowski (requires p)
  • Bayes

⚠️ Validation Rules

  • KNN requires k
  • Minkowski requires p
  • Bayes does not require extra parameters
  • YAML is validated before execution ...

⚑ Performance

  • Rust-native
  • Rayon parallelism
  • ndarray + BLAS ready

🧩 Embedding in Your Project

use vec_eyes_core::*;

let classifier = build_classifier(...)
    .with_method(MethodKind::KnnCosine { k: 5 })
    .with_nlp(NlpOption::FastText)
    .load_rules("rules.yaml")
    .train(datasets)?;

let result = classifier.classify("input data");

πŸ”— Relationship with CLI

  • vec-eyes-lib β†’ core engine
  • vec-eyes-cli β†’ interface layer

πŸ§ͺ Testing Guide

https://github.com/Orangewarrior/vec-eyes-lib/blob/main/tests/tests.md https://github.com/Orangewarrior/vec-eyes-lib/wiki/%F0%9F%A7%AA-Testing-Guide

🀝 Contributing

We welcome contributions in:

  • ML improvements
  • Performance optimization
  • Rule engine enhancements
  • Dataset integrations
  • Biological classification extensions

πŸ‘€ Author

Orangewarrior

If you like Vec-Eyes:

  • ⭐ Star the repo
  • πŸ’‘ Open issues
  • πŸ”§ Contribute

About

Vec-Eyes is a high-performance library for behavior classification engine in Rust, combining NLP, ML, vector embeddings, and rule-based matching.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages