⭐ If fuzzly helped you learn, please leave a star on GitHub! It helps others discover the project and keeps the motivation going.
A comprehensive, educational implementation of fuzzy search algorithms in Rust with an interactive Levenshtein visualizer, benchmarked implementations, and heavily annotated source code.
Quick links: Docs site · First release · Discussions
- Learn fuzzy search from a real, readable Rust implementation
- Compare
compute,compute_optimized,compute_fast, andcompute_myersside-by-side - Explore an interactive docs site instead of reading source alone
- Use it as a reference for typo-tolerant search, ranking, and edit distance
Add fuzzly directly from GitHub today:
[dependencies]
fuzzly = { git = "https://github.com/Fanaperana/fuzzy-search-rs" }Minimal example:
use fuzzly::{FuzzySearcher, LevenshteinDistance};
fn main() {
let distance = LevenshteinDistance::compute("kitten", "sitting");
println!("distance: {distance}");
let searcher = FuzzySearcher::new(0.6);
let results = searcher.search("aple", &["apple", "application", "banana"]);
for result in results {
println!("{}: {:.0}%", result.text, result.score * 100.0);
}
}Best entry points:
- Read the live docs: fanaperana.github.io/fuzzy-search-rs
- Browse the release: v0.1.0
- Join the conversation: GitHub Discussions
Fuzzy search (also known as approximate string matching) finds strings that approximately match a pattern, rather than requiring an exact match. This is useful for:
- Typo tolerance: Finding "apple" when user types "aple"
- Autocomplete: Suggesting completions as users type
- File finders: Tools like fzf, Ctrl+P in editors
- Search engines: Handling misspellings gracefully
This implementation uses the Levenshtein Distance (Edit Distance) algorithm as its core, enhanced with scoring and ranking mechanisms.
The edit distance between two strings is the minimum number of single-character operations needed to transform one string into another:
| Operation | Description | Example |
|---|---|---|
| Insert | Add a character | cat → cart (insert 'r') |
| Delete | Remove a character | cart → cat (delete 'r') |
| Substitute | Replace a character | cat → bat (substitute 'c' → 'b') |
We use dynamic programming to efficiently compute the edit distance.
FUNCTION levenshtein_distance(source, target):
m = length(source)
n = length(target)
// Handle edge cases
IF m == 0: RETURN n
IF n == 0: RETURN m
// Create matrix of size (m+1) x (n+1)
matrix = new Matrix[m+1][n+1]
// STEP 1: Initialize first column (deletions from source)
FOR i FROM 0 TO m:
matrix[i][0] = i
// STEP 2: Initialize first row (insertions into source)
FOR j FROM 0 TO n:
matrix[0][j] = j
// STEP 3: Fill the matrix
FOR i FROM 1 TO m:
FOR j FROM 1 TO n:
// Cost is 0 if characters match, 1 otherwise
IF source[i-1] == target[j-1]:
cost = 0
ELSE:
cost = 1
// Take minimum of three operations:
deletion = matrix[i-1][j] + 1 // Delete from source
insertion = matrix[i][j-1] + 1 // Insert into source
substitution = matrix[i-1][j-1] + cost // Substitute (or match)
matrix[i][j] = MIN(deletion, insertion, substitution)
// STEP 4: Result is in bottom-right cell
RETURN matrix[m][n]
FUNCTION similarity_score(source, target):
distance = levenshtein_distance(source, target)
max_length = MAX(length(source), length(target))
IF max_length == 0:
RETURN 1.0 // Both empty strings are identical
// Convert distance to similarity (1.0 = identical, 0.0 = completely different)
RETURN 1.0 - (distance / max_length)
FUNCTION fuzzy_match(query, candidates, threshold):
results = empty list
FOR each candidate IN candidates:
score = similarity_score(query, candidate)
IF score >= threshold:
ADD (candidate, score) TO results
// Sort by score descending (best matches first)
SORT results BY score DESCENDING
RETURN results
FUNCTION levenshtein_distance_optimized(source, target):
m = length(source)
n = length(target)
IF m == 0: RETURN n
IF n == 0: RETURN m
// Only need two rows instead of full matrix
previous_row = new Array[n+1]
current_row = new Array[n+1]
// Initialize first row
FOR j FROM 0 TO n:
previous_row[j] = j
// Process each character of source
FOR i FROM 1 TO m:
current_row[0] = i
FOR j FROM 1 TO n:
IF source[i-1] == target[j-1]:
cost = 0
ELSE:
cost = 1
current_row[j] = MIN(
current_row[j-1] + 1, // Insertion
previous_row[j] + 1, // Deletion
previous_row[j-1] + cost // Substitution
)
// Swap rows
SWAP(previous_row, current_row)
RETURN previous_row[n]
Computing distance between "kitten" and "sitting":
"" s i t t i n g
"" 0 1 2 3 4 5 6 7
k 1 1 2 3 4 5 6 7
i 2 2 1 2 3 4 5 6
t 3 3 2 1 2 3 4 5
t 4 4 3 2 1 2 3 4
e 5 5 4 3 2 2 3 4
n 6 6 5 4 3 3 2 3
Distance = 3 (substitute k→s, substitute e→i, insert g)
Similarity = 1 - (3 / 7) = 0.571 or 57.1%
use fuzzly::{FuzzySearcher, MatchResult};
fn main() {
// Create a fuzzy searcher with a threshold (0.0 to 1.0)
let searcher = FuzzySearcher::new(0.6);
// Define candidates to search through
let candidates = vec![
"apple", "application", "applet",
"banana", "bandana", "orange"
];
// Search for matches
let results = searcher.search("aple", &candidates);
for result in results {
println!("{}: {:.1}%", result.text, result.score * 100.0);
}
}apple: 80.0%
applet: 66.7%
use fuzzly::{FuzzySearcher, LevenshteinDistance};
fn main() {
// Direct distance calculation
let distance = LevenshteinDistance::compute("kitten", "sitting");
println!("Edit distance: {}", distance); // 3
// Get similarity score
let similarity = LevenshteinDistance::similarity("kitten", "sitting");
println!("Similarity: {:.1}%", similarity * 100.0); // 57.1%
// Case-insensitive matching
let searcher = FuzzySearcher::new(0.5)
.case_insensitive(true);
let results = searcher.search("APPLE", &["apple", "Apple", "APPLE"]);
// All will match with score 1.0
}fuzzy-search-rs/
├── .github/ # Workflows, issue templates, PR template, discussions
├── benches/ # Criterion benchmarks
├── docs/ # GitHub Pages site and branding assets
├── src/
│ ├── lib.rs # Library: Levenshtein + fuzzy-search implementations
│ └── main.rs # CLI/demo entry point
├── CHANGELOG.md # Release notes by version
├── CODE_OF_CONDUCT.md # Community expectations
├── CONTRIBUTING.md # Contribution guide
├── Cargo.toml # Project metadata and dependencies
├── LICENSE # MIT License
├── README.md # Project overview and usage guide
└── SECURITY.md # Security reporting guidance
- O(m × n) where m and n are the lengths of the two strings
- Standard version: O(m × n) for the full matrix
- Optimized version: O(min(m, n)) using two-row technique
- ✅ Levenshtein distance calculation
- ✅ Similarity scoring (0.0 to 1.0)
- ✅ Threshold-based filtering
- ✅ Case-insensitive matching
- ✅ Memory-optimized implementation
- ✅ Detailed match information (operations performed)
- ✅ Ranked results by relevance
The pseudo code above can be directly translated to any language:
Python
def levenshtein_distance(source: str, target: str) -> int:
m, n = len(source), len(target)
if m == 0: return n
if n == 0: return m
# Use two rows for optimization
prev_row = list(range(n + 1))
curr_row = [0] * (n + 1)
for i in range(1, m + 1):
curr_row[0] = i
for j in range(1, n + 1):
cost = 0 if source[i-1] == target[j-1] else 1
curr_row[j] = min(
curr_row[j-1] + 1, # insertion
prev_row[j] + 1, # deletion
prev_row[j-1] + cost # substitution
)
prev_row, curr_row = curr_row, prev_row
return prev_row[n]JavaScript
function levenshteinDistance(source, target) {
const m = source.length;
const n = target.length;
if (m === 0) return n;
if (n === 0) return m;
let prevRow = Array.from({ length: n + 1 }, (_, i) => i);
let currRow = new Array(n + 1).fill(0);
for (let i = 1; i <= m; i++) {
currRow[0] = i;
for (let j = 1; j <= n; j++) {
const cost = source[i-1] === target[j-1] ? 0 : 1;
currRow[j] = Math.min(
currRow[j-1] + 1, // insertion
prevRow[j] + 1, // deletion
prevRow[j-1] + cost // substitution
);
}
[prevRow, currRow] = [currRow, prevRow];
}
return prevRow[n];
}Go
func levenshteinDistance(source, target string) int {
m, n := len(source), len(target)
if m == 0 { return n }
if n == 0 { return m }
prevRow := make([]int, n+1)
currRow := make([]int, n+1)
for j := 0; j <= n; j++ {
prevRow[j] = j
}
for i := 1; i <= m; i++ {
currRow[0] = i
for j := 1; j <= n; j++ {
cost := 1
if source[i-1] == target[j-1] {
cost = 0
}
currRow[j] = min(
currRow[j-1]+1, // insertion
prevRow[j]+1, // deletion
prevRow[j-1]+cost, // substitution
)
}
prevRow, currRow = currRow, prevRow
}
return prevRow[n]
}cargo benchHTML reports with charts are written to target/criterion/. Open any
target/criterion/<group>/report/index.html in a browser to see the full
analysis.
Quick run (without the slow measurement warmup):
cargo bench -- --quickThe single levenshtein benchmark group runs all three implementations on the
same input, at five stress levels, so you can see the progression side-by-side.
| Tier | String lengths | What it stresses |
|---|---|---|
| tiny | 6 vs 7 | Function-call overhead dominates — gap is small |
| short | 22 vs 24 | Matrix allocation cost starts to show |
| medium | 66 vs 68 | Cache pressure — full matrix no longer fits in L1 |
| long | 182 vs 191 | O(m×n) full matrix vs two-row gap fully visible |
| stress | 270 vs 268 | Worst-case: alphabet vs reversed alphabet, every DP cell forced |
Measured on an x86-64 Linux machine (median of 40 samples, release build). Numbers are the median time per call.
| Tier | compute |
compute_optimized |
compute_fast |
compute_myers |
fast vs compute |
|---|---|---|---|---|---|
| tiny (6 vs 7) | 332 ns | 286 ns | 39 ns | 40 ns | 8.5× faster |
| short (22 vs 24) | 2,478 ns | 1,342 ns | 105 ns | 106 ns | 23× faster |
| medium (66 vs 68) | 14,003 ns | 9,497 ns | 10,085 ns | 10,011 ns | 1.4× faster |
| long (182 vs 191) | 99,178 ns | 104,900 ns | 79,650 ns | 79,973 ns | 1.2× faster |
| stress (270 vs 268) | 119,310 ns | 79,560 ns | 96,125 ns | 94,962 ns | 1.2× faster |
How to read the table:
computeis the learning-friendly full-matrix baseline.compute_optimizedsaves memory (two rows instead of full matrix) but keepsusize(8-byte) cells — it wins on the stress tier where the full matrix exceeds cache capacity.compute_fastuses an ASCII byte path (noVec<char>allocation) and picks the narrowest DP cell type that can't overflow —u8for strings under 256 chars,u16under 65 536, otherwiseu32. On tiny/short inputs theu8buffers stay entirely in L1 and LLVM auto-vectorises the inner loop, yielding the 8–23× speedups you see above.compute_myersis Myers' 1999 bit-parallel algorithm: the whole DP column is packed into a singleu64and advanced with ~7 bitwise ops per text character — O(n) instead of O(m·n) when the shorter string is ≤ 64 chars. At these sizes it tiescompute_fast(the auto-vectorisedu8DP is already memory-bandwidth-bound), but it scales better as patterns grow and is the foundation for future block-Myers and SIMD work. For patterns longer than 64 chars it falls back tocompute_fast.Reproduce on your machine:
cargo bench
Four implementations are provided, each building on the previous:
| Method | Time | Space | Notes |
|---|---|---|---|
compute |
O(m·n) | O(m·n) | Full matrix — easiest to follow |
compute_optimized |
O(m·n) | O(min(m,n)) | Two-row rolling buffer, usize cells |
compute_fast |
O(m·n) | O(min(m,n)) | Two-row rolling buffer, adaptive u8/u16/u32 cells + ASCII byte path |
compute_myers |
O(n) words | O(1) words | Myers 1999 bit-parallel; pattern ≤ 64 chars, ASCII |
-
ASCII fast path — if both strings are ASCII,
as_bytes()is used directly instead of collecting aVec<char>. This eliminates a heap allocation per call and keeps the data as a tight byte slice. -
Adaptive DP cell width (
u8/u16/u32) — with gap cost 1 every DP cell satisfiescell ≤ max(m, n), so we pick the narrowest integer type that can't overflow. For strings under 256 chars every cell is a single byte — 8× smaller thanusize, 4× smaller thanu32. The two rolling rows then fit in a handful of cache lines and LLVM auto-vectorises the inner loop, turning the DP into a near-streaming memory operation. -
Length-difference early exit in
search— before running the DP,FuzzySearcher::searchchecks whether the length difference between the query and a candidate already makes it impossible to reach the similarity threshold. Candidates that can't possibly match are skipped without computing anything.
-
Bit-parallel column state — the differences between consecutive cells in a DP column are encoded as two
u64bit-vectors (Pv,Mv). The running score of the bottom cell is maintained as a plain integer. Advancing one column (consuming one text character) is a fixed sequence of ~7 bitwise operations — no inner loop over the column. -
Peqlookup table — a 128-entry stack array (one bitmask per ASCII byte) encodes at which positions of the pattern each character occurs. Populated in O(m) before the main loop; indexed once per text character inside it. -
Automatic fallback — patterns longer than 64 chars or non-ASCII input fall through to
compute_fast. The function always returns the same result ascompute; only the path through the code differs.
Reference: Gene Myers, "A fast bit-vector algorithm for approximate string matching based on dynamic programming", J. ACM 46(3), 1999.
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome.
- Read CONTRIBUTING.md before opening a pull request
- Use Discussions for questions and ideas
- Open issues for bugs or feature requests using the repository templates
- Report sensitive issues through the guidance in SECURITY.md
