Creepy Crawlies

A plethora of web crawling tools are available to assist you, each with its own strengths and specialties. These tools automate the crawling process, making it faster and more efficient, allowing you to focus on analyzing the extracted data.

  1. Burp Suite Spider: Burp Suite, a widely used web application testing platform, includes a powerful active crawler called Spider. Spider excels at mapping out web applications, identifying hidden content, and uncovering potential vulnerabilities.
  2. OWASP ZAP (Zed Attack Proxy): ZAP is a free, open-source web application security scanner. It can be used in automated and manual modes and includes a spider component to crawl web applications and identify potential vulnerabilities.
  3. Scrapy (Python Framework): Scrapy is a versatile and scalable Python framework for building custom web crawlers. It provides rich features for extracting structured data from websites, handling complex crawling scenarios, and automating data processing. Its flexibility makes it ideal for tailored reconnaissance tasks.
  4. Apache Nutch (Scalable Crawler): Nutch is a highly extensible and scalable open-source web crawler written in Java. It's designed to handle massive crawls across the entire web or focus on specific domains. While it requires more technical expertise to set up and configure, its power and flexibility make it a valuable asset for large-scale reconnaissance projects.

Adhering to ethical and responsible crawling practices is crucial no matter which tool you choose. Always obtain permission before crawling a website, especially if you plan to perform extensive or intrusive scans. Be mindful of the website's server resources and avoid overloading them with excessive requests.

Scrapy

Installing Scrapy

You can easily install Scrapy using pip, the Python package installer:

m4cc18@htb[/htb]$ pip3 install scrapy

ReconSpider

First, run this command in your terminal to download the custom scrapy spider, ReconSpider, and extract it to the current working directory.

m4cc18@htb[/htb]$ wget -O ReconSpider.zip https://academy.hackthebox.com/storage/modules/144/ReconSpider.v1.2.zip

m4cc18@htb[/htb]$ unzip ReconSpider.zip 

With the files extracted, you can run ReconSpider.py using the following command:

m4cc18@htb[/htb]$ python3 ReconSpider.py http://inlanefreight.com

results.json

After running ReconSpider.py, the data will be saved in a JSON file, results.json. This file can be explored using any text editor. Below is the structure of the JSON file produced:

{
    "emails": [
        "lily.floid@inlanefreight.com",
        "cvs@inlanefreight.com",
        ...
    ],
    "links": [
        "https://www.themeansar.com",
        "https://www.inlanefreight.com/index.php/offices/",
        ...
    ],
    "external_files": [
        "https://www.inlanefreight.com/wp-content/uploads/2020/09/goals.pdf",
        ...
    ],
    "js_files": [
        "https://www.inlanefreight.com/wp-includes/js/jquery/jquery-migrate.min.js?ver=3.3.2",
        ...
    ],
    "form_fields": [],
    "images": [
        "https://www.inlanefreight.com/wp-content/uploads/2021/03/AboutUs_01-1024x810.png",
        ...
    ],
    "videos": [],
    "audio": [],
    "comments": [
        "<!-- #masthead -->",
        ...
    ]
}

Each key in the JSON file represents a different type of data extracted from the target website:

JSON Key Description
emails Lists email addresses found on the domain.
links Lists URLs of links found within the domain.
external_files Lists URLs of external files such as PDFs.
js_files Lists URLs of JavaScript files used by the website.
form_fields Lists form fields found on the domain (empty in this example).
images Lists URLs of images found on the domain.
videos Lists URLs of videos found on the domain (empty in this example).
audio Lists URLs of audio files found on the domain (empty in this example).
comments Lists HTML comments found in the source code.

By exploring this JSON structure, you can gain valuable insights into the web application's architecture, content, and potential points of interest for further investigation.


Exercise

After spidering inlanefreight.com, identify the location where future reports will be stored. Respond with the full domain, e.g., files.inlanefreight.com.

Perform the spidering:

┌──(myenv)─(macc㉿kaliLab)-[~/htb]
└─$ python3 ReconSpider.py http://inlanefreight.com

Look at the contents of the results.json file produced by the spidering:

{
    "emails": [
        "info@inlanefreight.com",
        "lily.floid@inlanefreight.com",
        "manuel.pernilious@inlanefreight.com",
        "emma.williams@inlanefreight.com",
        "hans.mueller@inlanefreight.com",
        "jeremy-ceo@inlanefreight.com",
        "john.smith4@inlanefreight.com",
        "freya.kartboom@inlanefreight.com",
        "cvs@inlanefreight.com",
        "david.jones@inlanefreight.com",
        "enterprise-support@inlanefreight.com",
        "info@themeansar.com",
        "enterprise@inlanefreight.com",
        "support@inlanefreight.com",
        "samuel.dot@inlanefreight.com",
        "fiona.dante@inlanefreight.com"
    ],
    "links": [
        "https://www.inlanefreight.com/index.php/offices/",
        "https://www.themeansar.com",
        "https://www.inlanefreight.com/index.php/news/",
        "https://www.inlanefreight.com/index.php/career/#content",
        "https://www.inlanefreight.com/index.php/news/#content",
        "https://www.inlanefreight.com/index.php/about-us/#content",
        "https://www.inlanefreight.com/index.php/about-us/",
        "https://www.inlanefreight.com",
        "https://www.inlanefreight.com/wp-content/uploads/2020/09/goals.pdf",
        "https://www.inlanefreight.com/index.php/contact/",
        "https://www.inlanefreight.com/",
        "https://www.inlanefreight.com/index.php/contact/#content",
        "https://www.inlanefreight.com/index.php/offices/#content",
        "https://www.inlanefreight.com/#content",
        "https://www.inlanefreight.com/index.php/career/"
    ],
    "external_files": [
        "https://www.inlanefreight.com/wp-content/uploads/2020/09/goals.pdf",
        "https://www.inlanefreight.com/index.php/news/pdf"
    ],
    "js_files": [
        "https://www.inlanefreight.com/wp-content/themes/ben_theme/js/owl.carousel.min.js?ver=5.6.16",
        "https://www.inlanefreight.com/wp-content/themes/ben_theme/js/navigation.js?ver=5.6.16",
        "https://www.inlanefreight.com/wp-content/themes/ben_theme/js/jquery.smartmenus.js?ver=5.6.16",
        "https://www.inlanefreight.com/wp-includes/js/jquery/jquery-migrate.min.js?ver=3.3.2",
        "https://www.inlanefreight.com/wp-includes/js/jquery/jquery.min.js?ver=3.5.1",
        "https://www.inlanefreight.com/wp-content/themes/ben_theme/js/bootstrap.min.js?ver=5.6.16",
        "https://www.inlanefreight.com/wp-content/themes/ben_theme/js/jquery.smartmenus.bootstrap.js?ver=5.6.16",
        "https://www.inlanefreight.com/wp-includes/js/wp-embed.min.js?ver=5.6.16"
    ],
    "form_fields": [],
    "images": [
        "https://www.inlanefreight.com/wp-content/uploads/2021/03/Offices_01-1024x359.png",
        "https://www.inlanefreight.com/wp-content/uploads/2021/03/Career_01-300x235.jpg",
        "https://www.inlanefreight.com/wp-content/uploads/2021/03/Career_02-300x235.jpg",
        "https://www.inlanefreight.com/wp-content/uploads/2021/03/AboutUs_03-1024x810.png",
        "https://www.inlanefreight.com/wp-content/uploads/2021/03/AboutUs_04-1024x810.png",
        "https://www.inlanefreight.com/wp-content/uploads/2021/03/AboutUs_01-1024x810.png",
        "https://www.inlanefreight.com/wp-content/uploads/2021/03/AboutUs_02-1024x810.png"
    ],
    "videos": [],
    "audio": [],
    "comments": [
        "<!-- navbar-toggle -->",
        "<!-- change Jeremy's email to jeremy-ceo@inlanefreight.com -->",
        "<!--==================== feature-product ====================-->",
        "<!-- Logo -->",
        "<!-- TO-DO: change the location of future reports to inlanefreight-comp133.s3.amazonaws.htb -->",
        "<!-- /Navigation -->",
        "<!--==================== transportex-FOOTER AREA ====================-->",
        "<!--/overlay-->",
        "<!-- Right nav -->",
        "<!-- /navbar-toggle -->",
        "<!-- #masthead -->",
        "<!--==================== TOP BAR ====================-->",
        "<!-- Blog Area -->",
        "<!--\nSkip to content<div class=\"wrapper\">\n<header class=\"transportex-trhead\">\n\t<!--==================== Header ====================-->",
        "<!--Sidebar Area-->",
        "<!-- #secondary -->",
        "<!-- /Right nav -->",
        "<!-- Navigation -->"
    ]
}

flag: nlanefreight-comp133.s3.amazonaws.htb