Skip to content

Commit 75d4dad

Browse files
author
SeanEaster
committed
Initial
0 parents  commit 75d4dad

File tree

7 files changed

+1120
-0
lines changed

7 files changed

+1120
-0
lines changed

ExampleOne.ipynb

Lines changed: 413 additions & 0 deletions
Large diffs are not rendered by default.

ExampleOneAnswered.ipynb

Lines changed: 404 additions & 0 deletions
Large diffs are not rendered by default.

citySunData.csv

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
,city,sunrise,sunset
2+
0,"New York, New York",7:14am,4:30pm
3+
1,"Chicago, Illinois",7:13am,4:21pm
4+
2,"Denver, Colorado",7:15am,4:37pm
5+
3,"Phoenix, Arizona",7:26am,5:23pm
6+
4,"Los Angeles, California",6:53am,4:46pm
7+
5,"San Francisco, California",7:19am,4:53pm
8+
6,"Seattle, Washington",7:53am,4:19pm
9+
7,"Anchorage, Alaska",10:12am,3:40pm
10+
8,"Honolulu, Hawaii",7:03am,5:53pm
11+
9,"Augusta National, Georgia, USA",7:26am,5:22pm
12+
10,"St Andrews, Fife, Scotland",8:41am,3:34pm
13+
11,"Windsor Park, 東茨城郡, Japan",6:45am,4:25pm
14+
12,"Grand Canyon, South Rim, Arizona",7:33am,5:16pm
15+
13,"Yosemite, El Capitan, California",7:08am,4:42pm
16+
14,"Old Faithful Geyser, Wyoming",7:54am,4:45pm
17+
15,"Haleakala, Maui, Hawaii",6:55am,5:48pm
18+
16,"Carlsbad Caverns, New Mexico",6:53am,4:55pm
19+
17,"Denali National Park, Alaska",10:41am,3:03pm
20+
18,"Calgary, Alberta",8:35am,4:30pm
21+
19,"Halifax, Nova Scotia",7:46am,4:35pm
22+
20,"Montréal, Québec",7:29am,4:12pm
23+
21,"Ottawa, Ontario",7:37am,4:21pm
24+
22,"Regina, Saskatchewan",8:54am,4:55pm
25+
23,"St. John's, Newfoundland",7:44am,4:10pm
26+
24,"Toronto, Ontario",7:46am,4:42pm
27+
25,"Vancouver, British Columbia",8:03am,4:15pm
28+
26,"Winnipeg, Manitoba",8:22am,4:28pm
29+
27,"Amsterdam, Netherlands",8:46am,4:27pm
30+
28,"Athens, Greece",7:35am,5:07pm
31+
29,"Bangkok, Thailand",6:35am,5:54pm
32+
30,"Bern, Switzerland",8:11am,4:42pm
33+
31,"Bogota, Colombia",5:57am,5:48pm
34+
32,"Brussels, Belgium",8:40am,4:37pm
35+
33,"Buenos Aires, Argentina",5:36am,8:04pm
36+
34,"Cairo, Egypt",6:45am,4:58pm
37+
35,"Calcutta, India",6:10am,4:55pm
38+
36,"Caracas, Venezuela",6:09am,5:39pm
39+
37,"Casablanca, Morocco",7:29am,5:25pm
40+
38,"København, Denmark",8:35am,3:37pm
41+
39,"Dublin, Ireland",8:36am,4:07pm
42+
40,"Frankfurt, Germany",8:20am,4:24pm
43+
41,"Glasgow, Scotland",8:44am,3:43pm
44+
42,"Havana, Cuba",7:04am,5:47pm
45+
43,"Hong Kong, China",6:57am,5:43pm
46+
44,"Jerusalem, Israel",6:33am,4:38pm
47+
45,"Johannesburg, South Africa",5:11am,6:57pm
48+
46,"Lima, Peru",5:40am,6:29pm
49+
47,"London, United Kingdom",8:02am,3:52pm
50+
48,"Madrid, Spain",8:32am,5:50pm
51+
49,"Mexico City, Mexico",7:04am,6:02pm
52+
50,"Moscow, Russia",9:55am,4:56pm
53+
51,"Nairobi, Kenya",6:23am,6:35pm
54+
52,"Paris, France",8:39am,4:55pm
55+
53,"Reykjavik, Iceland",11:19am,3:29pm
56+
54,"Rio de Janeiro, Brazil",6:03am,7:35pm
57+
55,"Rome, Italy",7:32am,4:40pm
58+
56,"San Jose, Costa Rica",5:46am,5:19pm
59+
57,"Santiago, Chile",6:28am,8:49pm
60+
58,"Seoul, South Korea",7:41am,5:16pm
61+
59,"Shanghai, China",6:47am,4:54pm
62+
60,"Singapore, Singapore",7:00am,7:02pm
63+
61,"Stanley, Falkland Islands",4:28am,9:08pm
64+
62,"Stockholm, Sweden",8:41am,2:47pm
65+
63,"Sydney, Australia",5:39am,8:03pm
66+
64,"Taipei, Taiwan",6:33am,5:08pm
67+
65,"東京 (Tokyo), Japan",6:45am,4:30pm
68+
66,"Vienna, Austria",7:40am,4:01pm
69+
67,"Wellington, New Zealand",5:43am,8:52pm

exampleTwo.ipynb

Lines changed: 208 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,208 @@
1+
{
2+
"metadata": {
3+
"name": ""
4+
},
5+
"nbformat": 3,
6+
"nbformat_minor": 0,
7+
"worksheets": [
8+
{
9+
"cells": [
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"# MOAR HELP PLEASE\n",
15+
"\n",
16+
"Thanks to your assistance, the toy shop is back on track. But sadly, the logistics elves are in a bit of a bind: They need help developing their deadlines for Santa. To better plan Santa's routes, they need to know when the sun sets and rises in a sample of places he'll visit.\n",
17+
"\n",
18+
"Thankfully, there exists a [site](http://www.sunrisesunset.com/) that has the information we need. All we need to do is **scrape** it from the site.\n",
19+
"\n",
20+
"## Our scraper\n",
21+
"\n",
22+
"Broadly, [web scraping](http://en.wikipedia.org/wiki/Web_scraping) refers to any time we extract information from websites. It can vary from simple human copy-pasting to sophisticated techniques that use machine learning.\n",
23+
"\n",
24+
"We'll build a simple program to:\n",
25+
"\n",
26+
" 1. Load the sunrisesunset.com home page\n",
27+
" 2. Find all the cities listed on the page\n",
28+
" 3. Get and save their sunrise and sunset times\n",
29+
" 4. Save the times to a file\n",
30+
"\n",
31+
"Along the way, we'll learn some stuff about web pages, and more Python syntax.\n",
32+
"\n",
33+
"### What's my browser *do* anyway?\n",
34+
"\n",
35+
"When you load a website, your browser sends a request through the interwebs; the server for the site you're trying to visit sends back data for your browser, which includes HTML and JavaScript. Your browser then **renders** that page to make it look friendly to humans. We're going to do pretty much the same thing, but instead of rendering the page, we're going to use clues in the HTML to find the data we need. Check out this little script:"
36+
]
37+
},
38+
{
39+
"cell_type": "code",
40+
"collapsed": false,
41+
"input": [
42+
"import urllib2, bs4 # Here we're importing two separate modules\n",
43+
"\n",
44+
"page = urllib2.urlopen(\"http://www.sunrisesunset.com/\") # this opens a connection to a site\n",
45+
"soup = bs4.BeautifulSoup(page.read()) # This reads the html the connection sends back\n",
46+
"print soup.prettify()"
47+
],
48+
"language": "python",
49+
"metadata": {},
50+
"outputs": [],
51+
"prompt_number": 1
52+
},
53+
{
54+
"cell_type": "markdown",
55+
"metadata": {},
56+
"source": [
57+
"That's still a huge mess. Let's make it easy on ourselves and use the browser. Right-click \"Selected U.S. cities\" on the page, then click \"Inspect element.\" This should bring up Chrome's developer panel, which should be showing you a (much more legible) view of the page's html.\n",
58+
"\n",
59+
"Notice that all of the data we want is within an html `<table>` tag. `BeautifulSoup` has a `find()` method that can find the first table that matches some criteria we pass it. But, the table we're looking for isn't the very first table in the document. We'll have to figure out some way to differentiate between it and the other `<table>` tags.\n",
60+
"\n",
61+
"Can you spot anything we might be able to use to differentiate the table we want?"
62+
]
63+
},
64+
{
65+
"cell_type": "code",
66+
"collapsed": false,
67+
"input": [
68+
"table = soup.find(\"table\", {\"width\":\"900\", \"class\":\"tableCenteredNoTopBottomMargin\"})"
69+
],
70+
"language": "python",
71+
"metadata": {},
72+
"outputs": [],
73+
"prompt_number": 6
74+
},
75+
{
76+
"cell_type": "markdown",
77+
"metadata": {},
78+
"source": [
79+
"This is a pattern you'll leverage a lot if you program more scrapers:\n",
80+
"\n",
81+
" 1. Find the information you're looking for\n",
82+
" 2. Find what differentiates it from other elements of the page\n",
83+
"\n",
84+
"Now that we've found the right table, we need to repeat a certain pattern over rows. This illustrates a very powerful concept called **iteration**. When we have a group of items we need to do the same thing to, we can make our lives way easier breaking it all down.\n",
85+
"\n",
86+
" 1. Figure out, what do I have to do to *just one* of these items?\n",
87+
" 2. How can I repeat that action over *each* item?\n",
88+
" \n",
89+
"Applying that process to our scraper: Each row has a table in it, and from that table we can get the city and its sunrise and sunset times."
90+
]
91+
},
92+
{
93+
"cell_type": "code",
94+
"collapsed": false,
95+
"input": [
96+
"def getRowData(row):\n",
97+
" tables = row.find_all(\"table\")\n",
98+
" data = [] # Using empty brackets this way creates an empty list\n",
99+
" for table in tables:\n",
100+
" city = table.find(\"b\").string # What's this doing?\n",
101+
" timeString = table.find(\"td\", {\"valign\" : \"top\"}).get_text() # What about this?\n",
102+
" sunrise, sunset = timeString.lower().split(\"sunrise: \")[1].split(\"sunset: \") # Okay, this one was a LITTLE unfair. Any guesses?\n",
103+
" data.append({\"city\" : city,\n",
104+
" \"sunrise\": sunrise,\n",
105+
" \"sunset\": sunset}) # .append() is a method of the list class: it adds an item to the list\n",
106+
" if data != []:\n",
107+
" return data\n",
108+
"rows = table.find_all(\"tr\")\n",
109+
"allData = [getRowData(row) for row in rows] # Um, what?\n",
110+
"\n",
111+
"\"\"\"\n",
112+
"Remember what we did with data up there? It was something like this:\n",
113+
"\n",
114+
"data = []\n",
115+
"for thing in someListOfThings:\n",
116+
" data.append(someFunction(thing))\n",
117+
"\n",
118+
"Python gives us a succinct way of doing this called list comprehensions. \n",
119+
"Here's an equivalent way of doing that for loop, with a list comprehension:\n",
120+
"\n",
121+
"data = [someFunction(thing) for thing in someListOfThings]\n",
122+
"\n",
123+
"(You also just learned about Python multiline comments. \n",
124+
"The # symbol comments out the rest of a line. \n",
125+
"Triple quotes can be used to block out lengthier comments, \n",
126+
"without putting yourself through the tedium of typing a dozen #s.)\n",
127+
"\"\"\"\n",
128+
"\n",
129+
"allData = [_ for _ in allData if _ != None] # Any guesses?\n",
130+
"print allData[0], \"\\n\", allData[1]\n",
131+
"allData = [item for collection in allData for item in collection] # WHOAH. Hold the presses, what's going on?\n",
132+
"print allData[0], \"\\n\", allData[1]"
133+
],
134+
"language": "python",
135+
"metadata": {},
136+
"outputs": [
137+
{
138+
"output_type": "stream",
139+
"stream": "stdout",
140+
"text": [
141+
"[{'city': u'New York, New York', 'sunset': u'4:30pm', 'sunrise': u'7:14am'}, {'city': u'Chicago, Illinois', 'sunset': u'4:21pm', 'sunrise': u'7:13am'}, {'city': u'Denver, Colorado', 'sunset': u'4:37pm', 'sunrise': u'7:15am'}] \n",
142+
"[{'city': u'Phoenix, Arizona', 'sunset': u'5:23pm', 'sunrise': u'7:26am'}, {'city': u'Los Angeles, California', 'sunset': u'4:46pm', 'sunrise': u'6:53am'}, {'city': u'San Francisco, California', 'sunset': u'4:53pm', 'sunrise': u'7:19am'}]\n",
143+
"{'city': u'New York, New York', 'sunset': u'4:30pm', 'sunrise': u'7:14am'} \n",
144+
"{'city': u'Chicago, Illinois', 'sunset': u'4:21pm', 'sunrise': u'7:13am'}\n"
145+
]
146+
}
147+
],
148+
"prompt_number": 11
149+
},
150+
{
151+
"cell_type": "markdown",
152+
"metadata": {},
153+
"source": [
154+
"Those last two lines are a *little* dense, so let's unpack 'em. List comprehensions are really intuitive after a little practice, but not at first. And they get confusing when you use list comprehension to replace **nested loops**, groups of for loops within one another.\n",
155+
"\n",
156+
"Let's start with that first line after the comment:\n",
157+
"\n",
158+
" allData = [_ for _ in allData if _ != None]\n",
159+
" \n",
160+
"This is really close to what we saw in the multiline comment above. The only difference is that it filters the items in the list. It's equivalent to this for loop:\n",
161+
"\n",
162+
" allData = []\n",
163+
" for _ in allData:\n",
164+
" if _ != None:\n",
165+
" allData.append(_)\n",
166+
"\n",
167+
"Why the underscore? There's really nothing special about it: It's just a commonly used throwaway variable. I've used it here so that if you see it out in the wild, you won't get confused wondering if it's something special. \n",
168+
"\n",
169+
"That last line though, what a doozy. Compare it to these equivalent lines and see if you can parse out what's going on:\n",
170+
"\n",
171+
" allData = [item for collection in allData for item in collection]\n",
172+
" \n",
173+
" # ...is equivalent to...\n",
174+
" \n",
175+
" allData = []\n",
176+
" for collection in allData:\n",
177+
" for item in collection:\n",
178+
" allData.append(item)\n",
179+
"\n",
180+
"It's tricky at first. Start the list comprehension with the innermost part of the loop, then start with the first for loop and move down. There's really no *need* to do this, but as your projects grow in scope, and you have many sets of iterations to perform, you'll come to appreciate the succinct syntax. (It's also generally faster, which may not matter for most applications, but could become a concern if your dataset\n",
181+
"\n",
182+
"Now all we need is a quick way to export this data to `.csv`, the way the elves like."
183+
]
184+
},
185+
{
186+
"cell_type": "code",
187+
"collapsed": false,
188+
"input": [
189+
"import pandas as pd\n",
190+
"pd.DataFrame(allData).to_csv(\"citySunData.csv\", encoding=\"utf-8\")\n"
191+
],
192+
"language": "python",
193+
"metadata": {},
194+
"outputs": [],
195+
"prompt_number": 12
196+
},
197+
{
198+
"cell_type": "markdown",
199+
"metadata": {},
200+
"source": [
201+
"Say what? Don't we have to store objects as variables? We have to this point, usually because we wanted to make repeated use them. Namely, we wanted to call multiple methods of the same object, with the same data. This time, we're only creating the `DataFrame` so we can write it to file."
202+
]
203+
}
204+
],
205+
"metadata": {}
206+
}
207+
]
208+
}

readme.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Munging and Scraping
2+
3+
This basic tutorial covers using Python's `pandas` library to clean and filter data, alongwith covers basic web scraping, and explains a little bit about Python syntax along the way. It was first given to a group of GIS analysts: It's aimed at people who know plenty about data, but assumes no knowledge of how to code.
4+
5+
Easy steps to getting ready
6+
1. To quickly get all the modules you'll need, download and install [Anaconda](https://store.continuum.io/cshop/anaconda/).
7+
2. Open a terminal, navigate to repo folder, and enter `ipython notebook`.
8+
3. Start with the first example. Firsttimers, note that you'll generally have to run the code snippets in order. Do this for each by clicking the code block, and then clicking the play button.
9+
10+
The first example aimed to give participants a chance to code with an experienced tutor present. One version leaves some coding to the participant, and another has all the answers coded. The second example is a walk-through of how to build a basic scraper.

santaSample.csv

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
Elf,Toy,Count
2+
Buddy,Etch-a-Sketch,2
3+
,Jack in the Box,7
4+
Jovie,Etch-a-Sketch,14
5+
,Jack in the Box,25
6+
,Wooden Horse,6
7+
Papa,Etch-a-Sketch,435
8+
,Jack in the Box,680
9+
,Wooden Horse,39
10+
Gary Easter,Etch-a-Sketch,415
11+
,Jack in the Box,675
12+
,Wooden Horse,184

santaSampleFiltered.csv

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
Elf,Count
2+
Jovie,6
3+
Papa,39
4+
Gary Easter,184

0 commit comments

Comments
 (0)