Build a website crawler using Scrapy Framework

Scrapy spider
Image by ClkerFreeVectorImages

Before a while I discovered the website crawling framework Scrapy as mentioned in my earlier post Nice Python website crawler framework.

Now I wrote a little website crawler using this genius framework. Here are the steps I’ve taken to get my blog https://www.ask-sheldon.com crawled:

Installation of Scrapy

  1. Install Python package management system (pip):
  2. Install required networking engine twisted and other dependencies:
  3. Install scrapy via Python’s package manager pip:

On Ubuntu you can install Scrapy by just follow the instructions in the manual (http://doc.scrapy.org/en/latest/topics/ubuntu.html#topics-ubuntu).

Try out Scrapy for the first time:

  1. Generate new Scrapy project:
  2. Generate new Scrapy spider module (named sheldon):

    This creates a complete modules structure like this:
    Directory structure scrapy after initialization
  3. Now you can run a simple testrun:

    This creates an output like this:

    As you can see, the crawler only crawled one single URL:

    That’s because this was only the generated test dummy with no useful configuration.

Build up an own website spider based on the Scrapy Framework

After this initial experiment I decided to implement my own Scrapy based page crawler to initialize (warm) the fullpagecache of this blog (WordPress’s WP-Supercache).  The spider should have had the following features:

  1. The spider crawls all links of a given domain recursively.  So all domain pages will be loaded ones and the page cache for these pages is warmed.
  2. All links that should be called can be specified via simple CSS selectors. Only matching links will be processed.
  3. Everything can be configured in a single configuration file.
  4. All application logs are written into a log-file in a separate folder (one per day, logs folder)
  5. The crawled URLs, page-titles, headers and statuses  are exported to a CSV file (per day, export folder)

You can download the Crawler from my GitHub project page on https://github.com/Bravehartk2/ScrapyCrawler.

Steps taken

These things I changed on the generated code to get my blog crawled:

  1. To anonymize my spider I renamed spiders/sheldon.py to spiders/cachewarmer.py
  2. Implemented a CrawlSpider (SheldonSpider) that uses LinkExtractor-based Rules to extract all relevant links from my blog pages based on diverse settings (in cachewarmer.py)
  3. Outsource all settings from the Crawler to the settings file (settings_sheldon.py). For example:
    • Crawler.settings.CRAWLER_DOMAINS => domains to accept / analyse
    • Crawler.settings.CSS_SELECTORS => CSS selectors to address  the links to crawl
  4. Implement a filter_links function to ignore all links that have the nofollow attribute set
    • function is called for every link found on a page to decide to follow it (append to crawleable link list) or not
    • logs droped links
  5. Implement a scrapy.Item based item class (PageCrawlerItem in Crawler/items.py) that can store meta data from the crawled pages
    • HTTP-status
    • page title
    • page url
    • HTTP response headers
    • The fields are set in the function parse_item of the SheldonSpider. The function is used by Scrapy to write the crawling results to a feed (CSV, XML, JSON => see feedexports in the Scrapy documentation)
  6. Activated the CSV export in the settings (settings_sheldon.py):
  7. Activated file based logging in the settings (settings_sheldon.py):
  8. The other settings are mostly standard configuration fields
  9. Symlinked settings_sheldon.py (settings.py)

How to run it

To get the crawler running you have to take the following steps:

  1. Install Scrapy
  2. Clone the repository (or just download it via GitHub)
  3. Set a symlink to a project specific settings file (in the Crawler folder)
  4. Copy settings from settings_sheldon.py and adapt them for your needs (CRAWLER_NAME, CRAWLER_DOMAINS, CRAWLER_START_URLS, CSS_SELECTORS etc.)
  5. Now you can run the crawler from the project folder by using the following command:
  6. That’s it! The crawler will now crawl everything you’ve configured
  7. You can following the process by:
  8. Crawled meta data (status, title, url and headers) is stored in export/whateveryouveconfiguredinsettingspy.csv

More information about Scrapy

For more information about the used Python library Scrapy you should have a look at the following resources:

Leave a Reply

Your email address will not be published. Required fields are marked *

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.