Skip to content

rickd3ckard/PhishNet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PhishNet

PhishNet is an opensource dotnet application written in c# which purpose is to mass-harvest email from the internet by crawling the website content fetched with HTTP request.

Installation and usage

Fetch the github repository and navigate to the PhishNet project folder, then run the console app using dotnet :

git clone https://github.com/rickd3ckard/PhishNet.git
cd PhishNet -> cd PhishNet
dotnet run -- domain https://www.era.be/fr 

Commands

Here find a list of the different possible commands:

Command Description Argument Type
domain Scrape a single domain url Domain URL string
domains Scrape multiple domains url from a text file list Text file path string
help Display help for the application

Modifiers

Here find a list of the different possible modifiers:

Modifier Description Argument Type
-d Max depth on a single website for the crawler Max depth int
-m Max mails for a single website Text file path int
-o Custom path of the output file Path string
-sd Allow the crawling of sub domains Allowed? bool
-t Number of threads in the thread pool Number of threads int
-f Filter for the allowed domains exentions Domain extension string
-username Username for the SQL database Username string
-password Password for the SQL database Password string
-database SQL Database name Database string
-address Address of the SQL server Address string

Max depth represents the depth at which the crawler will dig into the website. A provided depth of 1 will only scrape the landing page of the domain. During the scraping of this page, internal links will be saved, but not used. If a depth of 2 is provided, the crawler will visit each internal link saved on the first page, and gather a new list of internal links that will not be used. A provided depth of n will recursively scrape the links provided in n-1 until n is reached. A provided depth of -1 will recursively crawl links in n-1 until the internal links count returned in n-1 is 0.

A sub domain is a part of a website that branches off from the main domain and functions as a separate host name. For example, given the domain era.be, the addresses blog.era.be, shop.era.be, and dev.era.be are all subdomains because they are derived from and associated with the main domain.

The number of threads roughly represents the maximum of domains being crawled in parallel. The crawler will detect and store external links found on the domain inside a domain queue. Once the crawling of the domain is completed, another domain will be dequeued from the list and crawled - this process is repeated indefinitely (until memory overflows). Each domain crawling routine is executed on a single thread. Increasing the number of threads in the thread pool allows multiple domains to be dequeued and crawled simultaneously until the thread pool is exhausted. When a thread finishes its task, it is returned to the thread pool and becomes available for another domain crawling routine.

The filter restrains the crawler on a specific domain extension. For example, if a filter value .be is provided, all the external links crawled will not be enqueued if the extension does not match the filter value.

Command Examples

Here a couple of command displaying different uses.


dotnet run -- domain https://www.era.be/fr

This is the most basic command. The crawler will scrape the provided domain with a depth of 1, will output emails to the program installation folder, and crawl all gathered external domains one by one indefinitely.


dotnet run -- domain https://www.era.be/fr -f .be -t 50 -d 3

The crawler will scrape the provided domain recursively with a depth of 3, and only enqueue domains with the .be extension, and will crawl up to 50 domains concurrently using the thread pool.


dotnet run -- domain https://www.era.be/fr -o c:/users/rick/desktop/emails.txt -sd false

The crawler will output the emails inside the specified path. Sub-domains derived from the provided domain will not be crawled.


dotnet run -- domain https://www.era.be/fr -username "u30869404_Scrappy" -password "NicePasswors123" -database "u30869404_HttpScraper" -address "srv1424.hstgr.io"

The crawler will output the emails inside the provided SQL database.

Sql Database set up

Here find the SQL database set up to properly store crawled mails:

Table: mails

Column Type Description
website TINYTEXT Stores website names (e.g. example.com)
mail TINYTEXT Stores associated email addresses (e.g. [email protected])

Table: visiteddomains

Column Type Description
domain TINYTEXT Stores domain names (e.g. example.com)
date DATETIME Stores the date and time the domain was visited

SQL Command

Here find the SQL command to create such table:

CREATE TABLE mails (
    website TINYTEXT,
    mail TINYTEXT
);

CREATE TABLE visiteddomains (
    domain TINYTEXT,
    date DATETIME
);

About

PhishNet is an opensource dotnet application written in c# which purpose is to mass-harvest email from the internet by crawling the website content fetched with HTTP request.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages