A simple web crawler, written with pentesting in mind and some hacks for smart crawling
Recursively crawls a given website upto specified depth, extracting all the hrefs of the same domain(or subdomains if specified)
Finds all the input and POST forms on the crawled webpages
Reduces the total number of requests sent by only crawling an unique parameter once(explained below)
Of-course there are a lot of other crawlers available on the internet(Burp being the best imo), but they all have the problem of duplicate parameters, which sometimes puts them in a positing of infinite crawling
For eg, All of us have to face URLs like site.com/?id=1, and the "id" parameter can have a huge amount of values, it could go upto "id=99999" or more. Other crawlers would visit every single of those pages, treating each of them as a unique URL, which sometimes might generate a shitload of traffic, and an infinite crawling(which slows down overall crawling).
For that specific reason, this crawler was written and will detect duplicate parameters, and only visit a unique parameter once. Thus, it can crawl very large websites in a matter of minutes.
The main aim for this crawler is to gather as many GET and POST parameters, in a short amount of time. It does this by reducing the amount of URLs it has to visit by only visiting unique parameters,thus reducing the total URL's to crawl exponentially
Just clone this repo, and then install the requirements(or run the following commands)
git clone https://github.com/drigg3r/SnCrawler.git cd SnCrawler/ pip install -r requirements.txt
To look at all the available options, just run the file with
$ python SnCrawl.py --help usage: SnCrawl.py [-h] [-w "https://domain.com/"] [-d DEPTH] [-c "cooke1=val1; cookie2=val2"] [--subdomains] [-e "http://domain.com/logout"] [-v] [-o /home/user/saveLocation.txt] optional arguments: -h, --help show this help message and exit -d DEPTH, --depth DEPTH How many layers deep to crawl(defaults to 3) -c "cooke1=val1; cookie2=val2", --cookie "cooke1=val1; cookie2=val2" The cookies to use(if doing authenticated crawling) --subdomains To include subdomains -e "http://domain.com/logout", --exclude "http://domain.com/logout" The url to exclude from being crawled(like logout page) -v, --verbose To display verbose output -o /home/user/saveLocation.txt, --output /home/user/saveLocation.txt The output file where you want to write the scraped URL's(in Json) required arguments: -w "http(s)://domain.com/", --website "http(s)://domain.com/" The website you want to crawl
Here is how you can do a simple scaping of a website with the depth 2
python SnCrawler.py --website "http://domainToCrawl.com" --depth 2
You can also specify if you want to include subdomains, by
python SnCrawler.py -w "http://domainToCrawl.com" --subdomain #It will also crawl the subdomains now
By Default, it will display all the scraped URL's and POST parameters on the terminal itself, which can get pretty messy sometimes(especially for larger sites).
To cope with that, we have a
-o option, which will write all the scraped URLs in the specified output file in Json(for easier parsing during later use).
python SnCrawler.py --website "http://domainToCrawl" -o "/home/user/output.txt" #Will write all URLs to /home/user/out.txt
For cookies, you can specify the
-c option. You can directly copy them from burp( or any other intercepting proxy )
python SnCrawler.py -w "https://domainToCrawl.com" -c "cookie1=val1 ; cookie2=val2" #Will send all requests with the cookie values
Sometimes, there are some URL's you wouldn't want the crawler to visit, like logout pages which might destroy your session. You can specify them with
-e option. For multiple URLs, you can specify
-e multiple times
python SnCrawler.py -w "http://domainToCrawl.com" -c "cookie1=val1;" -e "http://domainToCrawl/logout" -e "http://domainToCrawl/destroy" #It will not send request to both of these URLs
You can use the
-v flag for displaying verbose output of every request being sent and data being parsed
python SnCrawler.py -w "http://domainToCrawl.com" -v #will display verbose information about every request being sent
You can find it here:
Thanks for reading!
Know your enemy!