Malicious URL Classification

한국어 설명은 여기를 참고하세요.

A hybrid deep learning model that classifies URLs into 4 categories — Benign, Phishing, Malware, and Defacement — achieving ~92% overall accuracy.

Overview

Raw URL strings alone are insufficient for reliable threat detection. This project combines two complementary signals in a multi-input neural network:

LSTM branch — treats the URL as a character sequence, learning structural patterns (subdomain nesting, path depth, suspicious suffixes)
Handcrafted feature branch — encodes domain-knowledge signals a sequence model might overlook (entropy, IP address presence, special characters)

The two branches are concatenated and jointly trained end-to-end.

Model Architecture

[URL text sequence]          [Handcrafted features (9)]
        │                              │
Embedding → LSTM(64)           Dense(32, ReLU)
        │                              │
        └──────── Concatenate(96) ─────┘
                        │
                 Dense(64, ReLU)
                        │
                  Dropout(0.5)
                        │
               Output(4, Softmax)

Extracted features (URLFeatureAnalyzer)

Feature	Rationale
URL entropy	High entropy → random/obfuscated strings
URL length	Malicious URLs tend to be longer
Digit count	Excessive digits signal IP-based or generated URLs
Parameter count	Many params → potential injection or tracking
Subdomain depth	Deep nesting is a common phishing pattern
HTTP / HTTPS	Protocol as a weak signal
IP address presence	Direct-IP URLs bypass DNS — high-risk signal
`%20` presence	URL encoding often used to obscure payloads
`@` presence	`@` in URL redirects to attacker-controlled host

Results

Class	Precision	Recall	F1-Score
Benign	0.87	0.94	0.90
Defacement	1.00	0.99	0.99
Malware	0.97	0.87	0.92
Phishing	0.86	0.87	0.86
Overall	0.92	0.92	0.92

Defacement achieves near-perfect scores — its URL patterns (injected paths, foreign TLDs) are highly distinctive. Phishing is the hardest class, as attackers actively mimic legitimate URL structures.

Dataset

malicious_phish.csv — Kaggle

Class	Samples
Benign	30,000
Phishing	30,000
Defacement	30,000
Malware	23,645

Environment

Python 3.10 (Google Colab)
TensorFlow / Keras, Pandas, NumPy, scikit-learn, Matplotlib, Seaborn

How to run

Upload malicious_phish.csv to Google Drive
Open the notebook in Colab (badge above)
Mount Google Drive and set the dataset path
Run all cells

License

MIT — use freely for research and educational purposes.

한국어 요약

LSTM 기반 혼합 신경망으로 URL을 정상, 피싱, 악성코드, 웹 변조 4가지로 분류하는 딥러닝 프로젝트입니다. 전체 정확도 약 92%를 달성했습니다.

핵심 아이디어 단순 URL 문자열만으로는 탐지가 어렵다는 점에 착안해, URL 시퀀스(LSTM)와 수작업 특성(엔트로피, IP 여부, 특수문자 등 9개)을 결합한 멀티 인풋 신경망을 설계했습니다.

성과

Defacement: F1 0.99 (URL 패턴이 뚜렷해 거의 완벽 분류)
Phishing: F1 0.86 (정상 URL을 모방하는 특성상 가장 어려운 클래스)
전체 Accuracy / Precision / Recall: 0.92

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
(인공지능)_악성_URL_분류.ipynb		(인공지능)_악성_URL_분류.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malicious URL Classification

Overview

Model Architecture

Results

Dataset

Environment

How to run

License

한국어 요약

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Malicious URL Classification

Overview

Model Architecture

Results

Dataset

Environment

How to run

License

한국어 요약

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages