A really nice web crawler that focuses more on branching out the internet rather than getting all your data and selling it to some company that's going to use it to train an AI model.
pip install WebVisCrawl
The pip package exposes two executables: webviscrawler and webvisualiser. You can get info on these by running them with --help.
git clone https://github.com/atomtables/WebVisCrawl
Don't feel like waiting 10 hours for 80,000 hits on different websites? Here are some readymade demos from info.cern.ch:
* 50,000+ samples not included due to file size and GitHub upload limits.
Create a venv and install requirements.txt. Then run:
python main.py <START_URL>
Or run with -h for help.
Run:
python vis.py --head <START_URL>
The HTML should open in your web browser. You can also run with -h for help.
Machine: MacBook Pro M2 (13in) under maximum load without IntelliJ.
Example: https://hackclub.com to three levels:
Redesigned to use queues and centralized message handling to avoid race conditions and improve accuracy (at some speed cost). Now includes Bloom filters for fast URL deduplication.
Example: https://hackclub.com to three levels:
*Dip in performance likely due to rate-limiting.
while this project does make use of web crawling, it is not representative of all use cases of web crawling. this project does not respect robots.txt files, although it takes safe measures to avoid aggressive crawling. you use this project at your own risk for educational purposes only. no one is liable but you if you cause trouble.