About the Gallery Download Web Spider
This is a simple python script I wrote that will download all the full size images in a
flickr gallery for you. This is my first python program and I was basically documenting
the process. I've since done a lot more but
haven't taken the time to put anything on here... probably will get around to it but most of it
was pretty specialized stuff and not sure anyone else could use it anyway.
I wrote this because I have a friend (Charlie Beacham) that
uploads pictures to flickr, but I couldn't download them at their original size without having to
do it one at a time and having to click through 3 pages for each image. When there is over 100
images in a gallery, that's just nuts! And so this script was born.
The script operates around the fact that the links to the thumbnails on the gallery preview pages
can be transformed into the urls for the full size orignal images. This will be explained in more
detail below though.
Also, you can download the full source code for the system including testing images at the bottom
of this page.
New to the Python Programming Language?
To use python, I recommend the Python IDE called "SPE". The main project website has been down
for a number of weeks now (used to be pythonide.stani.be)
but you can still download the program for both Windows
and Linux here: download.
Python is unique in that it has both files you can run just like any other scripting language, and a
fully interactive shell as well. From the shell, you can load modules, call functions, or even
write full python code straight up of you like. For this tool, we will use the shell just to
load our program file, and call the start() function to get things running.
You can also find a plethora of screenshots showing the SPE interface from
google image search.
Getting things set up to run the script
First, we need to load the module. In the python shell, you can do this by typing the following.
Please note that the ">>>" is the shell prompt, you don't actually type that, it should already
be there (if it's not just ignore it).
>>> import imagespider
Next, we will create an instance of the "imagespider" class that is defined inside the
imagespider.py module file that we have loaded. To do this, we simply run the command below
in the Python shell which specifies to create a new imagespider class instance.
>>> spider = imagespider.imagespider()
Now that we have an instance of the class named "spider", we simply call the start() function:
>>> spider.start()
As a side note, if you edit the imagespider.py file, you will need to reload the module file and then
redo all three of the commands above. To reload a module, simply type the following:
>>> reload(imagespider)
How the code Works
The first thing you will probably notice in the class starts off by defining a few things as
hard coded. One of these is the folder location to save the files. I've set this to a
folder in windows, so if you're using this on Linux you'll need to change that and reload
the module if you've already imported it. You might also notice that the path has two backslashes
where normally there is only one backslash. That is to "escape" the backslash since the backslash
is a special character in python strings. If that answer does not answer why you need the double
blackslashes, just believe me because that's beyond the scope of this artical.
# the flicker url we will use to download the html of the gallery
url = "http://www.flickr.com/photos/%s/sets/%s/?page=%s"
# this is the location the files will be saved in
path = "c:\\\\flickr\\\\"
# other parameters.
page = "1"
pages = ["1"]
image = ""
images = []
html = ""
gid = ""
user = ""';
$language = 'python';
You may also notice that the url has three "%s" signs in it. This is set up so we can easily replace
those %s signs with the values that define the user account, the gallery ID, and the page number.
For instance, a url such as http://www.flickr.com/photos/cooluserdude/sets/72157594455598587/ is the
url for the first page of the gallery with ID 72157594455598587 for the user "cooluserdude".
When you run the start() method as shown in the previous section, you will be prompted for the
user account name and the gallery ID. The page numbers the script will find and go through automatically.
>>> spider.start()
---------------------------- Begin Instructions --------------------------------------
Welcome to my Flickr Image Downloader Script! (c) Reece Pegues
You will need to provide the User and the Gallery ID. For instance, in the url below:
http://www.flickr.com/photos/theusername/sets/72157594167927335/
The part that says "theusername" is the user account name.
The part that says "72157594167927335" is the gallery ID
----------------------------- End Instructions ---------------------------------------
Enter Flickr User Account Name (or type quit): cooluserdude
Enter Flickr Gallery ID (or type quit): 72157594455598587
Here is the code for the start() function that prints the instructions above, asks for the
user account name and gallery ID, and then calls the other functions. Notice that it first calls
getpages(), then getimages(), the download(). These functions will be discussed later on.
def start(self):
# print out instructions
print ""
print "---------------------------- Begin Instructions --------------------------------------"
print "Welcome to my Flickr Image Downloader Script! (c) Reece Pegues"
print "You will need to provide the User and the Gallery ID. For instance, in the url below:"
print "http://www.flickr.com/photos/theusername/sets/72157594167927335/"
print ""
print "The part that says \"theusername\" is the user account name."
print "The part that says \"72157594167927335\" is the gallery ID"
print "----------------------------- End Instructions ---------------------------------------"
# get the user account name
self.user = raw_input("Enter Flickr User Account Name (or type quit): ")
if self.user=="quit":
return
# get the gallery ID.
self.gid = raw_input("Enter Flickr Gallery ID (or type quit): ")
if self.gid=="quit":
return
# trim the inputs and make sure they are not blank
self.user = self.user.strip()
self.gid = self.gid.strip()
if self.user=="" or self.gid=="":
print "Invalid User or Gallery ID. Exiting."
return
# get the number of pages in the gallery
if self.getpages(self.gid):
# start looping through the pages and getting all image urls
self.getimages()
# download all the images
self.download()
Fetching the page html
The first thing to be done is to find out how many pages the gallery is. To do this, we fetch the
first page of the gallery, find the navigation bar, and simply count the numbers! The
getpages() function parses the html and adds the page numbers it finds to an array of numbers
called pages. The fetchpage() function simply grabs the html from a given url and saves it in a
variable called "html". The code for these functions is below.
def getpages(self, gid):
# fetch the html from the first page of the gallery
self.fetchpage()
# if gallery not found, return error and quit
if self.html.find("Page not found") > 0:
print "Invalid Gallery ID given. Flickr returned page not found."
return 0
# locate area where the page navigation is and loop through it
t = self.html.find("Paginator")
e = self.html.find("Next",t)
while t < e:
# get page numbers out of the page navigation
t = self.html.find("<a",t)
t = self.html.find(">",t)+1
f = self.html.find("<",t)
p = self.html[t:f]
p = p.strip()
# if link is a digit (page number) add it to array of page numbers
if(p.isdigit()):
self.pages.append(p)
return 1
def fetchpage(self):
# set timeout higher so we don\'t screw up the download
urllib.socket.setdefaulttimeout(120)
print "Fetching html from %s" % (self.url % (self.user, self.gid, self.page))
# try fetching the html, return 0 if it errors out
try:
sock = urllib.urlopen(self.url % (self.user, self.gid, self.page))
self.html = sock.read()
return 1
except:
return 0
Finding the images in the html
Next, we will go through the html of each page looking for the images we wish to download. To do this,
we simply will find every <img> tag in the html, and check if the src attribute has the
word "farm" in the url. If you look at the html source, only images associated with the gallery pictures
have this distinction. If the word "farm" is found, we make sure that the image is not already in our
saved array of image urls, and then add it to the array.
Another thing to nottice is that we cut off the last digit of the filename in the url. The reason is
because the urls we fetch are only thumbnails, and we want to modify the url to download the
original full size image. To do this, we take advantage of the naming schema used by flickr.
The url for different sized images all start the same, but then have a single different character
on the very end. for instance, an image ending in "_s.jpg" is a thumbnail, while an image
ending in "_o.jpg" is the orignal full size image. Therefore, we simply cut off the last character
and .jpg from the url to simplify matters, and we will add the "o.jpg" just before we download the file.
The code for this, as well as the method for downloading and saving the file, is below.
def getimages(self):
# loop through page numbers
for p in self.pages:
# set current pae and fetch the html for that page
self.page = p
self.fetchpage()
t = 1
n = 0
# loop until we find all image tags, then break
while 1:
t = self.html.find("<img", t)
if t == -1:
break
# get the src of the image tag
t = self.html.find("src=",t) + 5
e = self.html.find(" ",t) - 6
i = self.html[t:e]
# if the src is one of the gallery images, save that src url
if i.find("farm") > 0:
# if index does not error, the image is already in the array.
try:
self.images.index(i)
except:
self.images.append(i)
n = n + 1
print "Found %d images in page %s" % (n, p)
def fetchimage(self, i, name):
# set timeout higher so we don\'t screw up a download
urllib.socket.setdefaulttimeout(120)
print "%s ----> %s" % (i,name)
# try saving the image, else print an error and return 0
try:
sock = urllib.URLopener()
sock.retrieve(i, name)
return 1
except:
print "Error fetching image %s" % name
return 0
Actually Downloading the files
Once we have completed the array of image urls, we call the download() method which will
loop through the array, add the "o.jpg" to the url, and download the images to the specified
location. The code for this can be seen below.
def download(self):
# loop through all images found
for p in self.images:
p = "%so.jpg" % p
# get filename
a = p.split("/")
for y in a:
name = y
# combine the path we are saving to and the filename being saved
file = "%s%s" % (self.path, name)
# try to fetch the image and save it
if self.fetchimage(p, file) == 0:
print "Downloading image failed, quiting and exiting"
break;
Try it out yourself
Well that's basically it. You can download the code below to test this out youself.
If you find any problems with it, feel free to contact me and let me know. I always appreciate
constructive criticism. Also, if you're wondering about the license, there is not one. You're
welcome to use the code all you want; I just put it online here in case someone else
found it interesting or could make use of it. Also, one final note. You might notice that when you
use the script it downloads 3 images unrelated to the Gallery we want. Those are the other Gallery's
previews seen at the bottom of the page. I could probably code it to ignore those, but an extra
three images is no big deal so I didn't bother.