{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MOAR HELP PLEASE\n", "\n", "Thanks to your assistance, the toy shop is back on track. But sadly, the logistics elves are in a bit of a bind: They need help developing their deadlines for Santa. To better plan Santa's routes, they need to know when the sun sets and rises in a sample of places he'll visit.\n", "\n", "Thankfully, there exists a [site](http://www.sunrisesunset.com/) that has the information we need. All we need to do is **scrape** it from the site.\n", "\n", "## Our scraper\n", "\n", "Broadly, [web scraping](http://en.wikipedia.org/wiki/Web_scraping) refers to any time we extract information from websites. It can vary from simple human copy-pasting to sophisticated techniques that use machine learning.\n", "\n", "We'll build a simple program to:\n", "\n", " 1. Load the sunrisesunset.com home page\n", " 2. Find all the cities listed on the page\n", " 3. Get and save their sunrise and sunset times\n", " 4. Save the times to a file\n", "\n", "Along the way, we'll learn some stuff about web pages, and more Python syntax.\n", "\n", "### What's my browser *do* anyway?\n", "\n", "When you load a website, your browser sends a request through the interwebs; the server for the site you're trying to visit sends back data for your browser, which includes HTML and JavaScript. Your browser then **renders** that page to make it look friendly to humans. We're going to do pretty much the same thing, but instead of rendering the page, we're going to use clues in the HTML to find the data we need. Check out this little script:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import urllib2, bs4 # Here we're importing two separate modules\n", "\n", "page = urllib2.urlopen(\"http://www.sunrisesunset.com/\") # this opens a connection to a site\n", "soup = bs4.BeautifulSoup(page.read()) # This reads the html the connection sends back\n", "print soup.prettify()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's still a huge mess. Let's make it easy on ourselves and use the browser. Right-click \"Selected U.S. cities\" on the page, then click \"Inspect element.\" This should bring up Chrome's developer panel, which should be showing you a (much more legible) view of the page's html.\n", "\n", "Notice that all of the data we want is within an html `