Skip to content
This repository was archived by the owner on Mar 3, 2025. It is now read-only.

Latest commit

 

History

History
56 lines (41 loc) · 2.2 KB

File metadata and controls

56 lines (41 loc) · 2.2 KB

Crawly

Crawly created by Patryk 'UltiPro' Wójtowicz using Python.

The project is a web crawler with web scrapper that implements both BFS and DFS search methods. It can be configured by selecting options such as search method, time limits, search depth, whether to generate a full graph, and optional proxy server settings. The application collects only URLs and the contents of "a" tags. However, the code can be easily adapted to specific needs in the "_process_page" function. During execution, the program launches a browser using the Playwright package. The browser navigates through web pages, if necessary, it pauses to let the user solve captchas etc. The output consists of a CSV file containing URLs and "a" tags contents, as well as an HTML page with a graph representing the connections between websites.

Dependencies and Usage

Dependencies:

  • beautifulsoup4 4.13.3
  • bs4 0.0.2
  • fake-useragent 2.0.3
  • greenlet 3.1.1
  • narwhals 1.28.0
  • networkx 3.4.2
  • numpy 2.2.3
  • packaging 24.2
  • playwright 1.50.0
  • plotly 6.0.0
  • pyee 12.1.1
  • soupsieve 2.6
  • typing_extensions 4.12.2

Installation:

cd "/Crawly"

pip install -r requirements.txt

playwright install

Using the app

python main.py [url-address] [options]

Option Short Description Default Value
--method -m Search method bfs
--time -t Execution time (s) 60
--depth -d Maximum search depth 10
--full_graph -fg Generate a full graph False
--proxy_server -ps Proxy server IP/address
--proxy_username -pu Proxy username
--proxy_password -pp Proxy password

Preview

Terminal Preview

CSV Preview

HTML Preview