Build a website crawler using Scrapy Framework

Before a while I discovered the website crawling framework Scrapy as mentioned in my earlier post Nice Python website crawler framework. Now I wrote a little website crawler using this genius framework. Here are the steps I’ve taken to get my blog https://www.ask-sheldon.com crawled: Installation of Scrapy Install Python package management system (pip):

Install required networking […]

Nice Python website crawler framework

Today I stumbled over http://scrapy.org/ while searching for an OpenSource website crawler. Its an interesting crawling and scraping framework for Python. It looks very convenient and easy to use. The most interesting feature seems to be the possibility to select website elements (f.e. hyperlinks) via CSS-selectors. In any case I’ll give it a try.

Deliver status 503 with you Python CGI-script

I’ve searched a lot for a solution of this simple looking problem. I found a lot of irritating and confusing information about this on the WWW. From implementing a socket-application to setting up an self-made HTTP-server. But the answer is much simpler:

The empty line after the “Retry-After” Header is very important!

Find and replace malware code blocks in php files via shell

Today I was attacked by an unknown bot or something like that. It placed the following code in many hundred index.php files on one of my servers, because the FTP-Password was cracked.

The solution was the following little Python script that walks through the filesystem tree and searches for index.php’s. In every matched file it […]

Protect directory with username and password

 To protect a folder with an password prompt, you only need to place a .htaccess and a .htpasswd into the target directory. .htaccess

.htpasswd

The passwort can be crypted via crypt or MD5. On http://de.selfhtml.org/servercgi/server/htaccess.htm#verzeichnisschutz you can find a useful hash generator. If you have a linux shell you can use:

Or you can […]