*

Flickr Gallery Downloader
About the Gallery Download Web Spider
This is a simple python script I wrote that will download all the full size images in a flickr gallery for you. This is my first python program and I was basically documenting the process. I've since done a lot more but haven't taken the time to put anything on here... probably will get around to it but most of it was pretty specialized stuff and not sure anyone else could use it anyway.

I wrote this because I have a friend (Charlie Beacham) that uploads pictures to flickr, but I couldn't download them at their original size without having to do it one at a time and having to click through 3 pages for each image. When there is over 100 images in a gallery, that's just nuts! And so this script was born.

The script operates around the fact that the links to the thumbnails on the gallery preview pages can be transformed into the urls for the full size orignal images. This will be explained in more detail below though.

Also, you can download the full source code for the system including testing images at the bottom of this page.


New to the Python Programming Language?
To use python, I recommend the Python IDE called "SPE". The main project website has been down for a number of weeks now (used to be pythonide.stani.be) but you can still download the program for both Windows and Linux here: download.

Python is unique in that it has both files you can run just like any other scripting language, and a fully interactive shell as well. From the shell, you can load modules, call functions, or even write full python code straight up of you like. For this tool, we will use the shell just to load our program file, and call the start() function to get things running.

You can also find a plethora of screenshots showing the SPE interface from google image search.


Getting things set up to run the script
First, we need to load the module. In the python shell, you can do this by typing the following. Please note that the ">>>" is the shell prompt, you don't actually type that, it should already be there (if it's not just ignore it).

>>> import imagespider


Next, we will create an instance of the "imagespider" class that is defined inside the imagespider.py module file that we have loaded. To do this, we simply run the command below in the Python shell which specifies to create a new imagespider class instance.

>>> spider = imagespider.imagespider()


Now that we have an instance of the class named "spider", we simply call the start() function:

>>> spider.start()


As a side note, if you edit the imagespider.py file, you will need to reload the module file and then redo all three of the commands above. To reload a module, simply type the following:

>>> reload(imagespider)



How the code Works
The first thing you will probably notice in the class starts off by defining a few things as hard coded. One of these is the folder location to save the files. I've set this to a folder in windows, so if you're using this on Linux you'll need to change that and reload the module if you've already imported it. You might also notice that the path has two backslashes where normally there is only one backslash. That is to "escape" the backslash since the backslash is a special character in python strings. If that answer does not answer why you need the double blackslashes, just believe me because that's beyond the scope of this artical.

# the flicker url we will use to download the html of the gallery
url = "http://www.flickr.com/photos/%s/sets/%s/?page=%s"
# this is the location the files will be saved in
path = "c:\\\\flickr\\\\"
# other parameters. 
page = "1"
pages = ["1"]
image = ""
images = []
html = ""
gid = ""
user = ""';
$language = 'python';

You may also notice that the url has three "%s" signs in it. This is set up so we can easily replace those %s signs with the values that define the user account, the gallery ID, and the page number. For instance, a url such as http://www.flickr.com/photos/cooluserdude/sets/72157594455598587/ is the url for the first page of the gallery with ID 72157594455598587 for the user "cooluserdude". When you run the start() method as shown in the previous section, you will be prompted for the user account name and the gallery ID. The page numbers the script will find and go through automatically.

>>> spider.start()
---------------------------- Begin Instructions --------------------------------------
Welcome to my Flickr Image Downloader Script! (c) Reece Pegues
You will need to provide the User and the Gallery ID. For instance, in the url below:
http://www.flickr.com/photos/theusername/sets/72157594167927335/

The part that says "theusername" is the user account name.
The part that says "72157594167927335" is the gallery ID
----------------------------- End Instructions ---------------------------------------
Enter Flickr User Account Name (or type quit): cooluserdude
Enter Flickr Gallery ID (or type quit): 72157594455598587


Here is the code for the start() function that prints the instructions above, asks for the user account name and gallery ID, and then calls the other functions. Notice that it first calls getpages(), then getimages(), the download(). These functions will be discussed later on.

    def start(self):
        # print out instructions
        print ""
        print "---------------------------- Begin Instructions --------------------------------------"
        print "Welcome to my Flickr Image Downloader Script!  (c) Reece Pegues"
        print "You will need to provide the User and the Gallery ID.  For instance, in the url below:"
        print "http://www.flickr.com/photos/theusername/sets/72157594167927335/"
        print ""
        print "The part that says \"theusername\" is the user account name."
        print "The part that says \"72157594167927335\" is the gallery ID"
        print "----------------------------- End Instructions ---------------------------------------"
        # get the user account name
        self.user = raw_input("Enter Flickr User Account Name (or type quit): ")
        if self.user=="quit":
            return
        # get the gallery ID.
        self.gid = raw_input("Enter Flickr Gallery ID (or type quit): ")
        if self.gid=="quit":
            return
        # trim the inputs and make sure they are not blank
        self.user = self.user.strip()
        self.gid = self.gid.strip()
        if self.user=="" or self.gid=="":
            print "Invalid User or Gallery ID. Exiting."
            return
        # get the number of pages in the gallery
        if self.getpages(self.gid):
            # start looping through the pages and getting all image urls
            self.getimages()
            # download all the images
            self.download()


Fetching the page html
The first thing to be done is to find out how many pages the gallery is. To do this, we fetch the first page of the gallery, find the navigation bar, and simply count the numbers! The getpages() function parses the html and adds the page numbers it finds to an array of numbers called pages. The fetchpage() function simply grabs the html from a given url and saves it in a variable called "html". The code for these functions is below.

def getpages(self, gid):
	# fetch the html from the first page of the gallery
	self.fetchpage()
	# if gallery not found, return error and quit
	if self.html.find("Page not found") > 0:
		print "Invalid Gallery ID given.  Flickr returned page not found."
		return 0
	# locate area where the page navigation is and loop through it
	t = self.html.find("Paginator")
	e = self.html.find("Next",t)
	while t < e:
		# get page numbers out of the page navigation
		t = self.html.find("<a",t) 
		t = self.html.find(">",t)+1
		f = self.html.find("<",t)
		p = self.html[t:f]
		p = p.strip()
		# if link is a digit (page number) add it to array of page numbers
		if(p.isdigit()):
			self.pages.append(p)
	return 1

def fetchpage(self):
	# set timeout higher so we don\'t screw up the download
	urllib.socket.setdefaulttimeout(120)
	print "Fetching html from %s" % (self.url % (self.user, self.gid, self.page))
	# try fetching the html, return 0 if it errors out
	try:
		sock = urllib.urlopen(self.url % (self.user, self.gid, self.page))
		self.html = sock.read()
		return 1
	except:
		return 0


Finding the images in the html
Next, we will go through the html of each page looking for the images we wish to download. To do this, we simply will find every <img> tag in the html, and check if the src attribute has the word "farm" in the url. If you look at the html source, only images associated with the gallery pictures have this distinction. If the word "farm" is found, we make sure that the image is not already in our saved array of image urls, and then add it to the array.

Another thing to nottice is that we cut off the last digit of the filename in the url. The reason is because the urls we fetch are only thumbnails, and we want to modify the url to download the original full size image. To do this, we take advantage of the naming schema used by flickr. The url for different sized images all start the same, but then have a single different character on the very end. for instance, an image ending in "_s.jpg" is a thumbnail, while an image ending in "_o.jpg" is the orignal full size image. Therefore, we simply cut off the last character and .jpg from the url to simplify matters, and we will add the "o.jpg" just before we download the file. The code for this, as well as the method for downloading and saving the file, is below.

def getimages(self):
	# loop through page numbers
	for p in self.pages:
		# set current pae and fetch the html for that page
		self.page = p
		self.fetchpage()
		t = 1
		n = 0
		# loop until we find all image tags, then break
		while 1:
			t = self.html.find("<img", t)
			if t == -1:
				break
			# get the src of the image tag
			t = self.html.find("src=",t) + 5
			e = self.html.find(" ",t) - 6
			i = self.html[t:e]
			# if the src is one of the gallery images, save that src url
			if i.find("farm") > 0:
				# if index does not error, the image is already in the array.
				try:
					self.images.index(i)
				except:
					self.images.append(i)
					n = n + 1
		print "Found %d images in page %s" % (n, p)

def fetchimage(self, i, name):
	# set timeout higher so we don\'t screw up a download
	urllib.socket.setdefaulttimeout(120)
	print "%s ----> %s" % (i,name)
	# try saving the image, else print an error and return 0
	try:
		sock = urllib.URLopener()
		sock.retrieve(i, name)
		return 1
	except:
		print "Error fetching image %s" % name
		return 0


Actually Downloading the files
Once we have completed the array of image urls, we call the download() method which will loop through the array, add the "o.jpg" to the url, and download the images to the specified location. The code for this can be seen below.

    def download(self):
        # loop through all images found
        for p in self.images:
            p = "%so.jpg" % p
            # get filename
            a = p.split("/")
            for y in a:
                name = y
            # combine the path we are saving to and the filename being saved
            file = "%s%s" % (self.path, name)
            # try to fetch the image and save it
            if self.fetchimage(p, file) == 0:
                print "Downloading image failed, quiting and exiting"
                break;


Try it out yourself
Well that's basically it. You can download the code below to test this out youself.

Download the code here.

If you find any problems with it, feel free to contact me and let me know. I always appreciate constructive criticism. Also, if you're wondering about the license, there is not one. You're welcome to use the code all you want; I just put it online here in case someone else found it interesting or could make use of it. Also, one final note. You might notice that when you use the script it downloads 3 images unrelated to the Gallery we want. Those are the other Gallery's previews seen at the bottom of the page. I could probably code it to ignore those, but an extra three images is no big deal so I didn't bother.



reece
home
history
baby
photos
calendar
addresses
wall
projects
4006
word
flickr
monitor
chat
lolmail
work
cocard
ibm
resume
dev
sudoku
security
portsentry
portknock
badbot
setuid
web
greasemonkey
visitors
links
downloads
misc
art
vote
influence
waffles