Creepy Crawlies
A plethora of web crawling tools are available to assist you, each with its own strengths and specialties. These tools automate the crawling process, making it faster and more efficient, allowing you to focus on analyzing the extracted data.
Popular Web Crawlers
- Burp Suite Spider: Burp Suite, a widely used web application testing platform, includes a powerful active crawler called Spider. Spider excels at mapping out web applications, identifying hidden content, and uncovering potential vulnerabilities.
- OWASP ZAP (Zed Attack Proxy): ZAP is a free, open-source web application security scanner. It can be used in automated and manual modes and includes a spider component to crawl web applications and identify potential vulnerabilities.
- Scrapy (Python Framework): Scrapy is a versatile and scalable Python framework for building custom web crawlers. It provides rich features for extracting structured data from websites, handling complex crawling scenarios, and automating data processing. Its flexibility makes it ideal for tailored reconnaissance tasks.
- Apache Nutch (Scalable Crawler): Nutch is a highly extensible and scalable open-source web crawler written in Java. It's designed to handle massive crawls across the entire web or focus on specific domains. While it requires more technical expertise to set up and configure, its power and flexibility make it a valuable asset for large-scale reconnaissance projects.
Adhering to ethical and responsible crawling practices is crucial no matter which tool you choose. Always obtain permission before crawling a website, especially if you plan to perform extensive or intrusive scans. Be mindful of the website's server resources and avoid overloading them with excessive requests.
Scrapy
Installing Scrapy
You can easily install Scrapy using pip, the Python package installer:
m4cc18@htb[/htb]$ pip3 install scrapy
- Note: first you need to Start a python environment.
- This command will download and install Scrapy along with its dependencies, preparing your environment for building our spider.
ReconSpider
First, run this command in your terminal to download the custom scrapy spider, ReconSpider, and extract it to the current working directory.
m4cc18@htb[/htb]$ wget -O ReconSpider.zip https://academy.hackthebox.com/storage/modules/144/ReconSpider.v1.2.zip
m4cc18@htb[/htb]$ unzip ReconSpider.zip
With the files extracted, you can run ReconSpider.py using the following command:
m4cc18@htb[/htb]$ python3 ReconSpider.py http://inlanefreight.com
- Replace
inlanefreight.comwith the domain you want to spider. The spider will crawl the target and collect valuable information.
results.json
After running ReconSpider.py, the data will be saved in a JSON file, results.json. This file can be explored using any text editor. Below is the structure of the JSON file produced:
{
"emails": [
"lily.floid@inlanefreight.com",
"cvs@inlanefreight.com",
...
],
"links": [
"https://www.themeansar.com",
"https://www.inlanefreight.com/index.php/offices/",
...
],
"external_files": [
"https://www.inlanefreight.com/wp-content/uploads/2020/09/goals.pdf",
...
],
"js_files": [
"https://www.inlanefreight.com/wp-includes/js/jquery/jquery-migrate.min.js?ver=3.3.2",
...
],
"form_fields": [],
"images": [
"https://www.inlanefreight.com/wp-content/uploads/2021/03/AboutUs_01-1024x810.png",
...
],
"videos": [],
"audio": [],
"comments": [
"<!-- #masthead -->",
...
]
}
Each key in the JSON file represents a different type of data extracted from the target website:
| JSON Key | Description |
|---|---|
emails |
Lists email addresses found on the domain. |
links |
Lists URLs of links found within the domain. |
external_files |
Lists URLs of external files such as PDFs. |
js_files |
Lists URLs of JavaScript files used by the website. |
form_fields |
Lists form fields found on the domain (empty in this example). |
images |
Lists URLs of images found on the domain. |
videos |
Lists URLs of videos found on the domain (empty in this example). |
audio |
Lists URLs of audio files found on the domain (empty in this example). |
comments |
Lists HTML comments found in the source code. |
By exploring this JSON structure, you can gain valuable insights into the web application's architecture, content, and potential points of interest for further investigation.
Exercise
After spidering inlanefreight.com, identify the location where future reports will be stored. Respond with the full domain, e.g., files.inlanefreight.com.
Perform the spidering:
┌──(myenv)─(macc㉿kaliLab)-[~/htb]
└─$ python3 ReconSpider.py http://inlanefreight.com
Look at the contents of the results.json file produced by the spidering:
{
"emails": [
"info@inlanefreight.com",
"lily.floid@inlanefreight.com",
"manuel.pernilious@inlanefreight.com",
"emma.williams@inlanefreight.com",
"hans.mueller@inlanefreight.com",
"jeremy-ceo@inlanefreight.com",
"john.smith4@inlanefreight.com",
"freya.kartboom@inlanefreight.com",
"cvs@inlanefreight.com",
"david.jones@inlanefreight.com",
"enterprise-support@inlanefreight.com",
"info@themeansar.com",
"enterprise@inlanefreight.com",
"support@inlanefreight.com",
"samuel.dot@inlanefreight.com",
"fiona.dante@inlanefreight.com"
],
"links": [
"https://www.inlanefreight.com/index.php/offices/",
"https://www.themeansar.com",
"https://www.inlanefreight.com/index.php/news/",
"https://www.inlanefreight.com/index.php/career/#content",
"https://www.inlanefreight.com/index.php/news/#content",
"https://www.inlanefreight.com/index.php/about-us/#content",
"https://www.inlanefreight.com/index.php/about-us/",
"https://www.inlanefreight.com",
"https://www.inlanefreight.com/wp-content/uploads/2020/09/goals.pdf",
"https://www.inlanefreight.com/index.php/contact/",
"https://www.inlanefreight.com/",
"https://www.inlanefreight.com/index.php/contact/#content",
"https://www.inlanefreight.com/index.php/offices/#content",
"https://www.inlanefreight.com/#content",
"https://www.inlanefreight.com/index.php/career/"
],
"external_files": [
"https://www.inlanefreight.com/wp-content/uploads/2020/09/goals.pdf",
"https://www.inlanefreight.com/index.php/news/pdf"
],
"js_files": [
"https://www.inlanefreight.com/wp-content/themes/ben_theme/js/owl.carousel.min.js?ver=5.6.16",
"https://www.inlanefreight.com/wp-content/themes/ben_theme/js/navigation.js?ver=5.6.16",
"https://www.inlanefreight.com/wp-content/themes/ben_theme/js/jquery.smartmenus.js?ver=5.6.16",
"https://www.inlanefreight.com/wp-includes/js/jquery/jquery-migrate.min.js?ver=3.3.2",
"https://www.inlanefreight.com/wp-includes/js/jquery/jquery.min.js?ver=3.5.1",
"https://www.inlanefreight.com/wp-content/themes/ben_theme/js/bootstrap.min.js?ver=5.6.16",
"https://www.inlanefreight.com/wp-content/themes/ben_theme/js/jquery.smartmenus.bootstrap.js?ver=5.6.16",
"https://www.inlanefreight.com/wp-includes/js/wp-embed.min.js?ver=5.6.16"
],
"form_fields": [],
"images": [
"https://www.inlanefreight.com/wp-content/uploads/2021/03/Offices_01-1024x359.png",
"https://www.inlanefreight.com/wp-content/uploads/2021/03/Career_01-300x235.jpg",
"https://www.inlanefreight.com/wp-content/uploads/2021/03/Career_02-300x235.jpg",
"https://www.inlanefreight.com/wp-content/uploads/2021/03/AboutUs_03-1024x810.png",
"https://www.inlanefreight.com/wp-content/uploads/2021/03/AboutUs_04-1024x810.png",
"https://www.inlanefreight.com/wp-content/uploads/2021/03/AboutUs_01-1024x810.png",
"https://www.inlanefreight.com/wp-content/uploads/2021/03/AboutUs_02-1024x810.png"
],
"videos": [],
"audio": [],
"comments": [
"<!-- navbar-toggle -->",
"<!-- change Jeremy's email to jeremy-ceo@inlanefreight.com -->",
"<!--==================== feature-product ====================-->",
"<!-- Logo -->",
"<!-- TO-DO: change the location of future reports to inlanefreight-comp133.s3.amazonaws.htb -->",
"<!-- /Navigation -->",
"<!--==================== transportex-FOOTER AREA ====================-->",
"<!--/overlay-->",
"<!-- Right nav -->",
"<!-- /navbar-toggle -->",
"<!-- #masthead -->",
"<!--==================== TOP BAR ====================-->",
"<!-- Blog Area -->",
"<!--\nSkip to content<div class=\"wrapper\">\n<header class=\"transportex-trhead\">\n\t<!--==================== Header ====================-->",
"<!--Sidebar Area-->",
"<!-- #secondary -->",
"<!-- /Right nav -->",
"<!-- Navigation -->"
]
}
-
Try to search for something that may indicate future reports being stored.
-
Look at the
"comments"section, there is a very obvious comment there."<!-- TO-DO: change the location of future reports to inlanefreight-comp133.s3.amazonaws.htb -->",
flag: nlanefreight-comp133.s3.amazonaws.htb