WebVisCrawl

Available for your own demo at PyPi!!!

pip install WebVisCrawl

The pip package exposes two executables: webviscrawler and webvisualiser. You can get info on these by running them with --help.

Source on GitHub

git clone https://github.com/atomtables/WebVisCrawl

Check out some readymade demos

Don't feel like waiting 10 hours for 80,000 hits on different websites? Here are some readymade demos from info.cern.ch:

50 URLs 100 URLs 500 URLs 1000 URLs (1.1MB) 2500 URLs (3MB) 5000 URLs (5MB) 10000 URLs (11MB) 25000 URLs (20MB) 50000 URLs (60MB) All the URLs (like 80k and 195MB) (equivalent to a memory bomb, please don't do this on a phone, even I couldn't get this open on my laptop.)

* 50,000+ samples not included due to file size and GitHub upload limits.

Running

Create a venv and install requirements.txt. Then run:

python main.py <START_URL>

Or run with -h for help.

To Visualize

Run:

python vis.py --head <START_URL>

The HTML should open in your web browser. You can also run with -h for help.

Speed Tests

Machine: MacBook Pro M2 (13in) under maximum load without IntelliJ.

Original Implementation (Multithreading only)

Example: https://hackclub.com to three levels:

1 process:
- 76.88s, 4501 nodes, 7663 edges
- 89.53s, 4792 nodes, 8058 edges
- 92.21s, 5555 nodes, 8500 edges
- 59.06s, 4405 nodes, 7052 edges
- 90.55s, 4159 nodes, 7283 edges
2 processes:
- 50.07s, 4977 nodes, 7963 edges
- 37.63s, 2322 nodes, 3067 edges (exception in thread)
- 40.19s, 956 nodes, 1541 edges
- 38.43s, 3655 nodes, 6203 edges
- 36.08s, 1285 nodes, 1786 edges
4 processes: (data not listed)
8 processes: (data not listed)

Redesigned to use queues and centralized message handling to avoid race conditions and improve accuracy (at some speed cost). Now includes Bloom filters for fast URL deduplication.

New Implementation (with Queues & Bloom Filters)

Example: https://hackclub.com to three levels:

1 process:
- 252% CPU, 24.10s, 5786 nodes, 13376 edges
- 304% CPU, 24.59s, 5134 nodes, 12645 edges
- 309% CPU, 21.09s, 5153 nodes, 12316 edges
- 328% CPU, 21.20s, 5226 nodes, 13191 edges
- 349% CPU, 24.54s, 5709 nodes, 12572 edges
2 processes:
- 393% CPU, 15.29s, 5165 nodes, 11732 edges
- 388% CPU, 19.64s, 4559 nodes, 10296 edges*
- 392% CPU, 18.50s, 5598 nodes, 12410 edges
- 339% CPU, 19.00s, 4754 nodes, 9577 edges*
- 354% CPU, 17.19s, 5231 nodes, 11774 edges
4 processes:
- 501% CPU, 16.34s, 5149 nodes, 11129 edges
- 476% CPU, 16.98s, 4681 nodes, 9674 edges*
- 493% CPU, 16.42s, 5251 nodes, 11402 edges
- 481% CPU, 17.22s, 5760 nodes, 11717 edges
- 482% CPU, 15.55s, 4888 nodes, 11470 edges*
8 processes:
- 577% CPU, 15.24s, 5320 nodes, 10127 edges
- 610% CPU, 18.26s, 5665 nodes, 11293 edges
- 594% CPU, 16.19s, 5335 nodes, 11936 edges
- 578% CPU, 15.22s, 4312 nodes, 8807 edges*
- 578% CPU, 15.35s, 5811 nodes, 13200 edges

*Dip in performance likely due to rate-limiting.

DISCLAIMER

while this project does make use of web crawling, it is not representative of all use cases of web crawling. this project does not respect robots.txt files, although it takes safe measures to avoid aggressive crawling. you use this project at your own risk for educational purposes only. no one is liable but you if you cause trouble.