String Analyzer extracts and analyzes printable strings from binary files. It is designed for malware analysts, reverse engineers, and forensics investigators who need to quickly surface URLs, IPs, registry keys, API names, and other indicators from executables, memory dumps, or disk images—and optionally generate an AI-ready analysis prompt.
- Zero runtime dependencies (Python standard library only).
- Single entry point: one CLI with batch and interactive modes.
- Library-friendly API: use
analyze_file()or lower-level functions in your own scripts.
📖 Practical guide (Medium) — step-by-step usage, workflows, and examples.
- Features
- Installation
- Quick start
- Usage
- Pattern categories
- Programmatic API
- Examples
- Configuration and limits
- Security and safety
- Development
- License
| Feature | Description |
|---|---|
| String extraction | ASCII and UTF-16LE (Windows PE); configurable min length and max_bytes; chunked read for large files. |
| Entropy | Shannon entropy (chunked when max_bytes set); high entropy suggests packed/encrypted content. |
| Pattern detection | Strict IPv4 (0–255), IPv6 (full and abbreviated), URLs (http/https/ftp/file/ws/wss), obfuscated URLs (hxxp, etc.), emails, MAC addresses, registry keys, system paths, DLLs, 300+ Windows APIs, CMD/PowerShell, obfuscation patterns. |
| Embedded extraction | URLs, IPs, emails, MACs found inside long strings (not only whole-line matches). |
| Decoding | Base64 (standard and URL-safe) and hex; decoded candidates in report. |
| Suspicious keywords | Extended set: malware, miner, steal, persist, evasion, etc., plus .NET namespaces. |
| Sensitive mode | --sensitive: lower obfuscation thresholds and more keywords for stricter triage. |
| Output formats | Unfiltered dump, categorized report, or AI-ready markdown prompt. |
| CLI & API | Full CLI (--encoding, --sensitive, --no-embedded); programmatic analyze_file(); no global state. |
Requirements: Python 3.8 or newer.
git clone https://github.com/anpa1200/String-Analyzer-.git && cd String-Analyzer-
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -e .After installation you get the string-analyzer command. From the project root you can also run:
python -m string_analyzerDevelopment (optional): pip install -e ".[dev]" adds pytest and ruff for tests and linting.
# Categorized report (default)
string-analyzer /path/to/binary -o report.txt
# All extracted strings, no categorization
string-analyzer /path/to/binary --unfiltered -o strings.txt
# AI-ready analysis prompt
string-analyzer /path/to/binary --ai-prompt -o prompt.md
# Interactive: prompt for file and output type
string-analyzer| Option | Description |
|---|---|
file |
Path to the binary file. Omit to run interactive mode. |
-o, --output PATH |
Output file (default: <basename>_strings.txt). |
--min-length N |
Minimum string length to extract (default: 4). |
--max-bytes N |
Stop reading after N bytes (safety for very large files). |
--unfiltered |
Output all extracted strings, one per line (no categories). |
--filtered |
Output categorized report (default when not using --unfiltered or --ai-prompt). |
--ai-prompt |
Generate markdown prompt for an AI assistant. |
--analyze-with {gemini,codex} |
Send categorized prompt to gemini-cli or codex-cli and print the AI analysis. Saves the prompt to -o; use --ai-output to save the AI response. |
--ai-output PATH |
Save the AI response to this file (when using --analyze-with). |
--encoding {ascii,utf16,both} |
Extract ASCII only, UTF-16LE only, or both (default: both). |
--sensitive |
Lower obfuscation thresholds; more suspicious keywords. |
--no-embedded |
Do not extract URLs/IPs/emails from inside long strings. |
-i, --interactive |
Force interactive mode (prompt for file and options). |
-q, --quiet |
Suppress non-error messages. |
-v, --verbose |
Verbose logging. |
--version |
Show version. |
--help |
Show help. |
- Unfiltered (
--unfiltered): sorted list of all extracted strings. Use for grepping or feeding into other tools. - Filtered (default): categorized report with entropy, plus sections such as URLS, IPS, WINDOWS_API_COMMANDS, DLLS, OBFUSCATED, etc.
- AI prompt (
--ai-prompt): same categories in a markdown prompt asking an AI to analyze behavior and functionality (e.g. for malware triage).
The --analyze-with option sends the categorized string report directly to an AI CLI so you get an analysis in one command instead of copying a prompt by hand.
- What it does: After extracting and categorizing strings (URLs, IPs, APIs, DLLs, obfuscation, etc.), the tool builds the same markdown prompt used by
--ai-prompt, writes it to the path given by-o(so you can keep or reuse it), then pipes that prompt into the chosen CLI. The AI’s reply is printed to the terminal; you can save it with--ai-output PATH. - Values:
gemini— uses gemini-cli (looks forgeminiorgemini-clion your PATH).codex— uses Codex CLI (codex exec -with the prompt on stdin). - Requirements: You must have one of these installed and on your PATH: Gemini CLI (e.g.
npm i -g @google/generative-ai-cli) or Codex CLI. The tool does not call cloud APIs itself; it only invokes the local CLI, which handles authentication and the model. - Example:
string-analyzer suspect.exe --analyze-with gemini -o prompt.txt --ai-output analysis.md
This saves the prompt toprompt.txt, sends it to Gemini, and writes the AI’s analysis toanalysis.md.
Run string-analyzer with no file argument (or use string-analyzer -i). The tool will:
- Ask for the file path.
- Ask whether to output all strings (unfiltered) or a filtered report.
- If filtered: ask whether to generate an AI prompt or a normal report.
- Ask for the output file path (with a default suggestion).
Interactive mode limits input to 50 MB by default to avoid accidental resource use.
Strings are classified into the following categories (empty categories are omitted from output):
| Category | Description |
|---|---|
WINDOWS_API_COMMANDS |
Known Windows API function names (300+). |
DLLS |
Strings matching typical DLL names (e.g. *.dll). |
URLS |
HTTP/HTTPS and similar URLs. |
IPS |
IPv4 addresses. |
IPV6 |
IPv6 addresses. |
EMAILS |
Email-like strings. |
WINDOWS_REGISTRY_KEYS |
Registry path patterns. |
POWERSHELL_COMMANDS |
PowerShell cmdlets/commands. |
CMD_COMMANDS |
CMD shell commands. |
FILES |
File path / filename patterns. |
SYSTEM_PATHS |
System directory paths. |
OBFUSCATED |
Patterns suggesting obfuscation (e.g. h[.]xxp, dotted IPs). |
DECODED_BASE64 |
Strings that successfully decode from Base64 to printable text. |
DECODED_HEX |
Strings that successfully decode from hex to printable text. |
SUSPICIOUS_KEYWORDS |
Substrings associated with malware (e.g. key terms). |
SUSPICIOUS_DOTNET |
.NET-related suspicious namespaces/keywords. |
MAC_ADDRESSES |
MAC addresses (e.g. 00:1A:2B:3C:4D:5E). |
The tool also computes file entropy. Combined with a low count of “useful” patterns (APIs, DLLs, CMD/PowerShell), high entropy can indicate a packed or obfuscated binary; this is noted in the report and in the AI prompt.
Use the package in your own Python code:
from string_analyzer import (
analyze_file,
extract_strings,
detect_patterns,
compute_file_entropy,
generate_normal_output,
generate_ai_prompt,
shannon_entropy,
)
from string_analyzer.analyzer import (
is_likely_obfuscated,
is_mostly_printable,
try_base64_decode,
try_hex_decode,
)result = analyze_file(
"/path/to/binary",
min_length=4,
max_bytes=None,
encoding="both", # "ascii", "utf16", or "both"
extract_embedded=True, # find URLs/IPs inside long strings
sensitive=False, # True: lower obfuscation thresholds
)
# result["file"], result["entropy"], result["strings"], result["patterns"], result["obfuscated"]from pathlib import Path
path = Path("sample.bin")
entropy = compute_file_entropy(path)
strings = extract_strings(path, min_length=4, max_bytes=10_000_000)
patterns = detect_patterns(strings) # New dict every time; no global state
obfuscated = is_likely_obfuscated(patterns, entropy)
report = generate_normal_output(patterns, entropy, obfuscated)
# Or: prompt_text = generate_ai_prompt(patterns, entropy, obfuscated)| Function | Description |
|---|---|
analyze_file(path, min_length=4, max_bytes=None) |
Full analysis; returns dict with file, entropy, strings, patterns, obfuscated. |
extract_strings(path, min_length=4, max_bytes=None) |
Extract unique printable strings; returns set[str]. |
compute_file_entropy(path) |
Shannon entropy of file bytes. |
shannon_entropy(s) |
Shannon entropy of a string. |
detect_patterns(strings) |
Categorize strings; returns new dict[str, set[str]]. |
is_likely_obfuscated(patterns, file_entropy) |
Heuristic: few “useful” patterns and entropy > threshold. |
generate_normal_output(patterns, entropy, obfuscated) |
Formatted filtered report text. |
generate_ai_prompt(patterns, entropy, obfuscated) |
Markdown prompt text for AI analysis. |
is_mostly_printable(s, threshold=0.9) |
Whether the string is mostly printable ASCII. |
try_base64_decode(s) |
Decode Base64 if valid and printable; else None. |
try_hex_decode(s) |
Decode hex if valid and printable; else None. |
Malware triage — get an AI prompt for a sample:
string-analyzer suspect.exe --ai-prompt -o triage_prompt.md
# Then paste triage_prompt.md into your AI assistant.Large file — limit read size and get a filtered report:
string-analyzer memory.dump --max-bytes 100000000 -o report.txtScript — use API and only print URLs and IPs:
from string_analyzer import analyze_file
r = analyze_file("sample.bin")
for s in r["patterns"].get("URLS", []):
print(s)
for s in r["patterns"].get("IPS", []):
print(s)Longer strings only:
string-analyzer binary --min-length 8 -o long_strings.txtMaximum sensitivity (UTF-16 + embedded URLs + lower obfuscation bar):
string-analyzer suspect.exe --encoding both --sensitive -o report.txtSend to Gemini or Codex for AI analysis (requires gemini-cli or codex on PATH):
string-analyzer suspect.exe --analyze-with gemini -o prompt.txt --ai-output analysis.md
string-analyzer suspect.exe --analyze-with codex --ai-output analysis.md- Minimum string length:
--min-length(default 4). Longer values reduce noise and speed up analysis. - Maximum bytes read:
--max-bytes. Omit for no limit; set for very large files to avoid high memory use. - Obfuscation heuristic: Implemented using
MIN_USEFUL_COUNT(default 10) andENTROPY_THRESHOLD(default 5.0) instring_analyzer.patterns. A file is flagged as likely obfuscated when the number of “useful” patterns (Windows API, DLLs, CMD, PowerShell) is below the count threshold and file entropy is above the entropy threshold.
- Input files: String Analyzer only reads the file and extracts printable strings; it does not execute or interpret code. Still, avoid running it on untrusted binaries in a sensitive environment without proper isolation.
- Large files: Use
--max-bytes(or themax_bytesparameter in the API) to cap how much is read; interactive mode uses a 50 MB default. - Output: Reports may contain URLs, IPs, and other indicators. Handle output according to your security and privacy policies.
pip install -e ".[dev]"
ruff check string_analyzer tests
pytest tests/ -vCI runs on push/PR: Ruff lint and pytest on Python 3.8, 3.10, and 3.12.
Documentation: Practical guide (Medium) · docs/DOCUMENTATION.md (patterns, heuristics, workflows)
| Resource | Link |
|---|---|
| String-Analyzer (this repo) | GitHub · Medium: String Analyzer Guide |
| Static-malware-Analysis-Orchestrator | GitHub — one-command pipeline (triage, strings, PE imports, unpack) · Medium: Full workflow |
| PE-Import-Analyzer | GitHub · Medium: PE Import Analyzer Guide |
| Unpacker | GitHub · Medium: Unpacker Guide |
| Basic-File-Information-Gathering-Script | GitHub · Medium: File Metadata & Static Analysis |
| Author | Medium @1200km |
Distributed under the GNU General Public License v3.0. See LICENSE for details.
Contributions are welcome; please open an issue or submit a pull request.