String Analyzer

String Analyzer extracts and analyzes printable strings from binary files. It is designed for malware analysts, reverse engineers, and forensics investigators who need to quickly surface URLs, IPs, registry keys, API names, and other indicators from executables, memory dumps, or disk images—and optionally generate an AI-ready analysis prompt.

Zero runtime dependencies (Python standard library only).
Single entry point: one CLI with batch and interactive modes.
Library-friendly API: use analyze_file() or lower-level functions in your own scripts.

📖 Practical guide (Medium) — step-by-step usage, workflows, and examples.

Features

Feature	Description
String extraction	ASCII and UTF-16LE (Windows PE); configurable min length and `max_bytes`; chunked read for large files.
Entropy	Shannon entropy (chunked when `max_bytes` set); high entropy suggests packed/encrypted content.
Pattern detection	Strict IPv4 (0–255), IPv6 (full and abbreviated), URLs (http/https/ftp/file/ws/wss), obfuscated URLs (hxxp, etc.), emails, MAC addresses, registry keys, system paths, DLLs, 300+ Windows APIs, CMD/PowerShell, obfuscation patterns.
Embedded extraction	URLs, IPs, emails, MACs found inside long strings (not only whole-line matches).
Decoding	Base64 (standard and URL-safe) and hex; decoded candidates in report.
Suspicious keywords	Extended set: malware, miner, steal, persist, evasion, etc., plus .NET namespaces.
Sensitive mode	`--sensitive`: lower obfuscation thresholds and more keywords for stricter triage.
Output formats	Unfiltered dump, categorized report, or AI-ready markdown prompt.
CLI & API	Full CLI (`--encoding`, `--sensitive`, `--no-embedded`); programmatic `analyze_file()`; no global state.

Installation

Requirements: Python 3.8 or newer.

git clone https://github.com/anpa1200/String-Analyzer-.git && cd String-Analyzer-
python3 -m venv venv
source venv/bin/activate   # Windows: venv\Scripts\activate
pip install -e .

After installation you get the string-analyzer command. From the project root you can also run:

python -m string_analyzer

Development (optional): pip install -e ".[dev]" adds pytest and ruff for tests and linting.

Quick start

# Categorized report (default)
string-analyzer /path/to/binary -o report.txt

# All extracted strings, no categorization
string-analyzer /path/to/binary --unfiltered -o strings.txt

# AI-ready analysis prompt
string-analyzer /path/to/binary --ai-prompt -o prompt.md

# Interactive: prompt for file and output type
string-analyzer

Usage

Command-line options

Option	Description
`file`	Path to the binary file. Omit to run interactive mode.
`-o`, `--output PATH`	Output file (default: `<basename>_strings.txt`).
`--min-length N`	Minimum string length to extract (default: 4).
`--max-bytes N`	Stop reading after N bytes (safety for very large files).
`--unfiltered`	Output all extracted strings, one per line (no categories).
`--filtered`	Output categorized report (default when not using `--unfiltered` or `--ai-prompt`).
`--ai-prompt`	Generate markdown prompt for an AI assistant.
`--analyze-with {gemini,codex}`	Send categorized prompt to gemini-cli or codex-cli and print the AI analysis. Saves the prompt to `-o`; use `--ai-output` to save the AI response.
`--ai-output PATH`	Save the AI response to this file (when using `--analyze-with`).
`--encoding {ascii,utf16,both}`	Extract ASCII only, UTF-16LE only, or both (default: both).
`--sensitive`	Lower obfuscation thresholds; more suspicious keywords.
`--no-embedded`	Do not extract URLs/IPs/emails from inside long strings.
`-i`, `--interactive`	Force interactive mode (prompt for file and options).
`-q`, `--quiet`	Suppress non-error messages.
`-v`, `--verbose`	Verbose logging.
`--version`	Show version.
`--help`	Show help.

Output modes

Unfiltered (--unfiltered): sorted list of all extracted strings. Use for grepping or feeding into other tools.
Filtered (default): categorized report with entropy, plus sections such as URLS, IPS, WINDOWS_API_COMMANDS, DLLS, OBFUSCATED, etc.
AI prompt (--ai-prompt): same categories in a markdown prompt asking an AI to analyze behavior and functionality (e.g. for malware triage).

External AI analysis (`--analyze-with`)

The --analyze-with option sends the categorized string report directly to an AI CLI so you get an analysis in one command instead of copying a prompt by hand.

What it does: After extracting and categorizing strings (URLs, IPs, APIs, DLLs, obfuscation, etc.), the tool builds the same markdown prompt used by --ai-prompt, writes it to the path given by -o (so you can keep or reuse it), then pipes that prompt into the chosen CLI. The AI’s reply is printed to the terminal; you can save it with --ai-output PATH.
Values: gemini — uses gemini-cli (looks for gemini or gemini-cli on your PATH). codex — uses Codex CLI (codex exec - with the prompt on stdin).
Requirements: You must have one of these installed and on your PATH: Gemini CLI (e.g. npm i -g @google/generative-ai-cli) or Codex CLI. The tool does not call cloud APIs itself; it only invokes the local CLI, which handles authentication and the model.
Example:
string-analyzer suspect.exe --analyze-with gemini -o prompt.txt --ai-output analysis.md
This saves the prompt to prompt.txt, sends it to Gemini, and writes the AI’s analysis to analysis.md.

Interactive mode

Run string-analyzer with no file argument (or use string-analyzer -i). The tool will:

Ask for the file path.
Ask whether to output all strings (unfiltered) or a filtered report.
If filtered: ask whether to generate an AI prompt or a normal report.
Ask for the output file path (with a default suggestion).

Interactive mode limits input to 50 MB by default to avoid accidental resource use.

Pattern categories

Strings are classified into the following categories (empty categories are omitted from output):

Category	Description
`WINDOWS_API_COMMANDS`	Known Windows API function names (300+).
`DLLS`	Strings matching typical DLL names (e.g. `*.dll`).
`URLS`	HTTP/HTTPS and similar URLs.
`IPS`	IPv4 addresses.
`IPV6`	IPv6 addresses.
`EMAILS`	Email-like strings.
`WINDOWS_REGISTRY_KEYS`	Registry path patterns.
`POWERSHELL_COMMANDS`	PowerShell cmdlets/commands.
`CMD_COMMANDS`	CMD shell commands.
`FILES`	File path / filename patterns.
`SYSTEM_PATHS`	System directory paths.
`OBFUSCATED`	Patterns suggesting obfuscation (e.g. `h[.]xxp`, dotted IPs).
`DECODED_BASE64`	Strings that successfully decode from Base64 to printable text.
`DECODED_HEX`	Strings that successfully decode from hex to printable text.
`SUSPICIOUS_KEYWORDS`	Substrings associated with malware (e.g. key terms).
`SUSPICIOUS_DOTNET`	.NET-related suspicious namespaces/keywords.
`MAC_ADDRESSES`	MAC addresses (e.g. `00:1A:2B:3C:4D:5E`).

The tool also computes file entropy. Combined with a low count of “useful” patterns (APIs, DLLs, CMD/PowerShell), high entropy can indicate a packed or obfuscated binary; this is noted in the report and in the AI prompt.

Programmatic API

Use the package in your own Python code:

from string_analyzer import (
    analyze_file,
    extract_strings,
    detect_patterns,
    compute_file_entropy,
    generate_normal_output,
    generate_ai_prompt,
    shannon_entropy,
)
from string_analyzer.analyzer import (
    is_likely_obfuscated,
    is_mostly_printable,
    try_base64_decode,
    try_hex_decode,
)

One-shot analysis

result = analyze_file(
    "/path/to/binary",
    min_length=4,
    max_bytes=None,
    encoding="both",        # "ascii", "utf16", or "both"
    extract_embedded=True,  # find URLs/IPs inside long strings
    sensitive=False,        # True: lower obfuscation thresholds
)
# result["file"], result["entropy"], result["strings"], result["patterns"], result["obfuscated"]

Step-by-step

from pathlib import Path
path = Path("sample.bin")
entropy = compute_file_entropy(path)
strings = extract_strings(path, min_length=4, max_bytes=10_000_000)
patterns = detect_patterns(strings)  # New dict every time; no global state
obfuscated = is_likely_obfuscated(patterns, entropy)
report = generate_normal_output(patterns, entropy, obfuscated)
# Or: prompt_text = generate_ai_prompt(patterns, entropy, obfuscated)

Function reference

Function	Description
`analyze_file(path, min_length=4, max_bytes=None)`	Full analysis; returns dict with `file`, `entropy`, `strings`, `patterns`, `obfuscated`.
`extract_strings(path, min_length=4, max_bytes=None)`	Extract unique printable strings; returns `set[str]`.
`compute_file_entropy(path)`	Shannon entropy of file bytes.
`shannon_entropy(s)`	Shannon entropy of a string.
`detect_patterns(strings)`	Categorize strings; returns new `dict[str, set[str]]`.
`is_likely_obfuscated(patterns, file_entropy)`	Heuristic: few “useful” patterns and entropy > threshold.
`generate_normal_output(patterns, entropy, obfuscated)`	Formatted filtered report text.
`generate_ai_prompt(patterns, entropy, obfuscated)`	Markdown prompt text for AI analysis.
`is_mostly_printable(s, threshold=0.9)`	Whether the string is mostly printable ASCII.
`try_base64_decode(s)`	Decode Base64 if valid and printable; else `None`.
`try_hex_decode(s)`	Decode hex if valid and printable; else `None`.

Examples

Malware triage — get an AI prompt for a sample:

string-analyzer suspect.exe --ai-prompt -o triage_prompt.md
# Then paste triage_prompt.md into your AI assistant.

Large file — limit read size and get a filtered report:

string-analyzer memory.dump --max-bytes 100000000 -o report.txt

Script — use API and only print URLs and IPs:

from string_analyzer import analyze_file
r = analyze_file("sample.bin")
for s in r["patterns"].get("URLS", []):
    print(s)
for s in r["patterns"].get("IPS", []):
    print(s)

Longer strings only:

string-analyzer binary --min-length 8 -o long_strings.txt

Maximum sensitivity (UTF-16 + embedded URLs + lower obfuscation bar):

string-analyzer suspect.exe --encoding both --sensitive -o report.txt

Send to Gemini or Codex for AI analysis (requires gemini-cli or codex on PATH):

string-analyzer suspect.exe --analyze-with gemini -o prompt.txt --ai-output analysis.md
string-analyzer suspect.exe --analyze-with codex --ai-output analysis.md

Configuration and limits

Minimum string length: --min-length (default 4). Longer values reduce noise and speed up analysis.
Maximum bytes read: --max-bytes. Omit for no limit; set for very large files to avoid high memory use.
Obfuscation heuristic: Implemented using MIN_USEFUL_COUNT (default 10) and ENTROPY_THRESHOLD (default 5.0) in string_analyzer.patterns. A file is flagged as likely obfuscated when the number of “useful” patterns (Windows API, DLLs, CMD, PowerShell) is below the count threshold and file entropy is above the entropy threshold.

Security and safety

Input files: String Analyzer only reads the file and extracts printable strings; it does not execute or interpret code. Still, avoid running it on untrusted binaries in a sensitive environment without proper isolation.
Large files: Use --max-bytes (or the max_bytes parameter in the API) to cap how much is read; interactive mode uses a 50 MB default.
Output: Reports may contain URLs, IPs, and other indicators. Handle output according to your security and privacy policies.

Development

pip install -e ".[dev]"
ruff check string_analyzer tests
pytest tests/ -v

CI runs on push/PR: Ruff lint and pytest on Python 3.8, 3.10, and 3.12.

Documentation: Practical guide (Medium) · docs/DOCUMENTATION.md (patterns, heuristics, workflows)

Related repositories & articles

Resource	Link
String-Analyzer (this repo)	GitHub · Medium: String Analyzer Guide
Static-malware-Analysis-Orchestrator	GitHub — one-command pipeline (triage, strings, PE imports, unpack) · Medium: Full workflow
PE-Import-Analyzer	GitHub · Medium: PE Import Analyzer Guide
Unpacker	GitHub · Medium: Unpacker Guide
Basic-File-Information-Gathering-Script	GitHub · Medium: File Metadata & Static Analysis
Author	Medium @1200km

License

Distributed under the GNU General Public License v3.0. See LICENSE for details.

Contributions are welcome; please open an issue or submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
docs		docs
string_analyzer		string_analyzer
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

String Analyzer

Table of contents

Features

Installation

Quick start

Usage

Command-line options

Output modes

External AI analysis (`--analyze-with`)

Interactive mode

Pattern categories

Programmatic API

One-shot analysis

Step-by-step

Function reference

Examples

Configuration and limits

Security and safety

Development

Related repositories & articles

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

String Analyzer

Table of contents

Features

Installation

Quick start

Usage

Command-line options

Output modes

External AI analysis (--analyze-with)

Interactive mode

Pattern categories

Programmatic API

One-shot analysis

Step-by-step

Function reference

Examples

Configuration and limits

Security and safety

Development

Related repositories & articles

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

External AI analysis (`--analyze-with`)

Packages