{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MOAR HELP PLEASE\n", "\n", "Thanks to your assistance, the toy shop is back on track. But sadly, the logistics elves are in a bit of a bind: They need help developing their deadlines for Santa. To better plan Santa's routes, they need to know when the sun sets and rises in a sample of places he'll visit.\n", "\n", "Thankfully, there exists a [site](http://www.sunrisesunset.com/) that has the information we need. All we need to do is **scrape** it from the site.\n", "\n", "## Our scraper\n", "\n", "Broadly, [web scraping](http://en.wikipedia.org/wiki/Web_scraping) refers to any time we extract information from websites. It can vary from simple human copy-pasting to sophisticated techniques that use machine learning.\n", "\n", "We'll build a simple program to:\n", "\n", " 1. Load the sunrisesunset.com home page\n", " 2. Find all the cities listed on the page\n", " 3. Get and save their sunrise and sunset times\n", " 4. Save the times to a file\n", "\n", "Along the way, we'll learn some stuff about web pages, and more Python syntax.\n", "\n", "### What's my browser *do* anyway?\n", "\n", "When you load a website, your browser sends a request through the interwebs; the server for the site you're trying to visit sends back data for your browser, which includes HTML and JavaScript. Your browser then **renders** that page to make it look friendly to humans. We're going to do pretty much the same thing, but instead of rendering the page, we're going to use clues in the HTML to find the data we need. Check out this little script:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import urllib2, bs4 # Here we're importing two separate modules\n", "\n", "page = urllib2.urlopen(\"http://www.sunrisesunset.com/\") # this opens a connection to a site\n", "soup = bs4.BeautifulSoup(page.read()) # This reads the html the connection sends back\n", "print soup.prettify()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's still a huge mess. Let's make it easy on ourselves and use the browser. Right-click \"Selected U.S. cities\" on the page, then click \"Inspect element.\" This should bring up Chrome's developer panel, which should be showing you a (much more legible) view of the page's html.\n", "\n", "Notice that all of the data we want is within an html `` tag. `BeautifulSoup` has a `find()` method that can find the first table that matches some criteria we pass it. But, the table we're looking for isn't the very first table in the document. We'll have to figure out some way to differentiate between it and the other `
` tags.\n", "\n", "Can you spot anything we might be able to use to differentiate the table we want?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "table = soup.find(\"table\", {\"width\":\"900\", \"class\":\"tableCenteredNoTopBottomMargin\"})" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a pattern you'll leverage a lot if you program more scrapers:\n", "\n", " 1. Find the information you're looking for\n", " 2. Find what differentiates it from other elements of the page\n", "\n", "Now that we've found the right table, we need to repeat a certain pattern over rows. This illustrates a very powerful concept called **iteration**. When we have a group of items we need to do the same thing to, we can make our lives way easier breaking it all down.\n", "\n", " 1. Figure out, what do I have to do to *just one* of these items?\n", " 2. How can I repeat that action over *each* item?\n", " \n", "Applying that process to our scraper: Each row has a table in it, and from that table we can get the city and its sunrise and sunset times." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def getRowData(row):\n", " tables = row.find_all(\"table\")\n", " data = [] # Using empty brackets this way creates an empty list\n", " for table in tables:\n", " city = table.find(\"b\").string # What's this doing?\n", " timeString = table.find(\"td\", {\"valign\" : \"top\"}).get_text() # What about this?\n", " sunrise, sunset = timeString.lower().split(\"sunrise: \")[1].split(\"sunset: \") # Okay, this one was a LITTLE unfair. Any guesses?\n", " data.append({\"city\" : city,\n", " \"sunrise\": sunrise,\n", " \"sunset\": sunset}) # .append() is a method of the list class: it adds an item to the list\n", " if data != []:\n", " return data\n", "rows = table.find_all(\"tr\")\n", "allData = [getRowData(row) for row in rows] # Um, what?\n", "\n", "\"\"\"\n", "Remember what we did with data up there? It was something like this:\n", "\n", "data = []\n", "for thing in someListOfThings:\n", " data.append(someFunction(thing))\n", "\n", "Python gives us a succinct way of doing this called list comprehensions. \n", "Here's an equivalent way of doing that for loop, with a list comprehension:\n", "\n", "data = [someFunction(thing) for thing in someListOfThings]\n", "\n", "(You also just learned about Python multiline comments. \n", "The # symbol comments out the rest of a line. \n", "Triple quotes can be used to block out lengthier comments, \n", "without putting yourself through the tedium of typing a dozen #s.)\n", "\"\"\"\n", "\n", "allData = [_ for _ in allData if _ != None] # Any guesses?\n", "print allData[0], \"\\n\", allData[1]\n", "allData = [item for collection in allData for item in collection] # WHOAH. Hold the presses, what's going on?\n", "print allData[0], \"\\n\", allData[1]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[{'city': u'New York, New York', 'sunset': u'4:30pm', 'sunrise': u'7:14am'}, {'city': u'Chicago, Illinois', 'sunset': u'4:21pm', 'sunrise': u'7:13am'}, {'city': u'Denver, Colorado', 'sunset': u'4:37pm', 'sunrise': u'7:15am'}] \n", "[{'city': u'Phoenix, Arizona', 'sunset': u'5:23pm', 'sunrise': u'7:26am'}, {'city': u'Los Angeles, California', 'sunset': u'4:46pm', 'sunrise': u'6:53am'}, {'city': u'San Francisco, California', 'sunset': u'4:53pm', 'sunrise': u'7:19am'}]\n", "{'city': u'New York, New York', 'sunset': u'4:30pm', 'sunrise': u'7:14am'} \n", "{'city': u'Chicago, Illinois', 'sunset': u'4:21pm', 'sunrise': u'7:13am'}\n" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Those last two lines are a *little* dense, so let's unpack 'em. List comprehensions are really intuitive after a little practice, but not at first. And they get confusing when you use list comprehension to replace **nested loops**, groups of for loops within one another.\n", "\n", "Let's start with that first line after the comment:\n", "\n", " allData = [_ for _ in allData if _ != None]\n", " \n", "This is really close to what we saw in the multiline comment above. The only difference is that it filters the items in the list. It's equivalent to this for loop:\n", "\n", " allData = []\n", " for _ in allData:\n", " if _ != None:\n", " allData.append(_)\n", "\n", "Why the underscore? There's really nothing special about it: It's just a commonly used throwaway variable. I've used it here so that if you see it out in the wild, you won't get confused wondering if it's something special. \n", "\n", "That last line though, what a doozy. Compare it to these equivalent lines and see if you can parse out what's going on:\n", "\n", " allData = [item for collection in allData for item in collection]\n", " \n", " # ...is equivalent to...\n", " \n", " allData = []\n", " for collection in allData:\n", " for item in collection:\n", " allData.append(item)\n", "\n", "It's tricky at first. Start the list comprehension with the innermost part of the loop, then start with the first for loop and move down. There's really no *need* to do this, but as your projects grow in scope, and you have many sets of iterations to perform, you'll come to appreciate the succinct syntax. (It's also generally faster, which may not matter for most applications, but could become a concern if your dataset\n", "\n", "Now all we need is a quick way to export this data to `.csv`, the way the elves like." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "pd.DataFrame(allData).to_csv(\"citySunData.csv\", encoding=\"utf-8\")\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Say what? Don't we have to store objects as variables? We have to this point, usually because we wanted to make repeated use them. Namely, we wanted to call multiple methods of the same object, with the same data. This time, we're only creating the `DataFrame` so we can write it to file." ] } ], "metadata": {} } ] }