by Philip Guo (firstname.lastname@example.org)
I have developed an easy-to-use Python script for automatically harvesting JPEG images from a website and a selected number of websites linked from that starting site. It uses the free GNU Wget program to download images, and a number of heuristics to try to grab only images from the most relevant sites. It can be thought of as a more specialized and ‘intelligent’ Wget.
To use my Image Harvester script, simply download image-harvester.py, and run it on a computer with Wget installed in a directory where you want to download the images:
python image-harvester.py <url-to-harvest>
This script downloads all .jpg images on and linked from <url-to-harvest>, then follows all webpage URL links on that page, downloads images on all those pages, and then follows one more level of webpage URL links from those pages to grab images, except that this time it only follows URLs in the SAME domain to prevent jumping to outside sites. It creates one sub-directory for images downloaded from every page that it crawls to.
Your choice of <url-to-harvest> is important in determining how many images this script can harvest. For optimum results, try to choose a page that contains lots of images that you want and also lots of links to other pages with lots of images. The maximum depth of webpage links that this script follows is 2, but that should be enough for most image harvesting purposes. Additional levels of recursion usually results in undesired crawling to irrelevant sites.
The Image Harvester script cannot distinguish between the images that you want to keep from ones that you don’t (e.g., thumbnails, ads, and banners). I have written a Image Filterer BASH shell script that tries to filter out undesired images based on a heuristic of dimensions. If either an image’s width or height are below some minimum threshold (350×350 is what I use), then it’s probably a thumbnail, ad, or banner that you don’t want to keep. This script uses the ImageMagick inspectprogram to inspect the dimensions of all .jpg images, throw away the ones that don’t meet some minimum threshold, and then throw away sub-directories that don’t contain any more images.
To filter your images, download keep-images-larger-than.shand run it in the same directory where you ran image-harvester.py:
./keep-images-larger-than.sh <min-width> <min-height>
This will first create sub-directories named small-images-trashand small-images-trash/no-jpgs-dirs to store the filtered-out files and directories, respectively. Then it will find all .jpg images within all sub-directories and move any file whose width or height is less than <min-width> or <min-height>, respectively, into small-images-trash. As a last step, it will move any directories that contain no .jpg images into small-images-trash/no-jpgs-dirs. These trash directories provide a safety net to protect against accidental deletions. After running the Image Harvester and Image Filterer, your sub-directories should be filled only with full-sized images that you want to keep.
Here is the code. Please give it a shot and email me with feedback if you have trouble getting it to work or want me to add additional features.
Here are the two main problems that I’ve experienced with automated web crawling and downloading tools, and how this project tries to solve them:
- The recursion grows out of control, and the tool ends up crawling to irrelevant sites and downloading tons of images that you don’t want, like annoying ads or banners.
To solve this problem, I apply a few heuristics to ensure that my script has the best chance of only grabbing the images that you want without crawling to irrelevant sites:
- It first grabs all images on and linked from <url-to-harvest>, which is either a page with lots of images or a ‘links page’ that links to other pages with lots of images. Your choice of the starting page is important in determining how many images this script can retrieve.
- Then it crawls to all page links from <url-to-harvest>and grabs all images off of those pages. It must follow all links during this step because you may start at a ‘links page’ which provides links to sites at many domains with images that you want to grab.
- Then it performs one more level of crawling from those pages (linked from starting page) to only pages in the SAME domain. When you are already at a site that contains the images you want, it may have links to related pages in the same domain with additional images. Links to outside domains at this point are likely to be ads, redirects, or other irrelevant sites.
- Automated tools get blocked by many webservers because they hog up bandwidth and allow people to download images without viewing the requisite ads and generating revenue for the site.
Here is how my script tries to prevent getting rejected by servers:
- Tells Wget to imitate the Firefox web browser in its HTTP request User Agent field in order to hopefully not trip the anti-leeching mechanisms of servers.
- Tells Wget to slow down its requests and provide a randomized variation on times between requests in order to not overload web servers and reduce suspicion that it’s actually an automated tool instead of a human clicking on and downloading images.
Warning: It is not polite to download large numbers of images from websites in an automated fashion because it eats up bandwidth without the need to actually view the content of the sites. Please do not abuse my Image Harvester script by using it to download too many images at once. Whenever possible, browse the actual sites first to show courtesy and support to them, because the webmasters expect you to view their contents.