{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Help for the Elf management\n", "\n", "It's that time of the year again: The toy crunch is upon us. The elves are *way* behind on wooden horses, and they need to put their best wooden horse cobblers on the assembly line.\n", "\n", "Unfortunately, the diligent toy makers are not the best record keepers. They don't like retyping names, so they nest their data like this:\n", "\n", "| Elf | Toy | Count |\n", "| --- | --- | -----:|\n", "|Buddy|Etch-a-Sketch|2|\n", "||Jack in the Box|7|\n", "|Jovie|Etch-a-Sketch|14|\n", "||Jack in the Box|25|\n", "||Wooden Horse|6|\n", "\n", "If you have a spreadsheet program installed, try opening `santaSample.csv` in it. It's a sample of the elves' records. We need to write a script that edits this data to fill in the right name; then, we need to filter out everything that's not a wooden horse.\n", "\n", "Thankfully, the library `pandas` can help us out.\n", "\n", "## What's `pandas`?\n", "\n", "[`pandas`](http://pandas.pydata.org/) is a library for data analysis. For now, think of a library like this: It's an existing batch of code with instructions\u2014called \"documentation\"\u2014on how to use it. In particular, a user doesn't need to know the internal details of how a library works to make good use of it.\n", "\n", "`pandas` is also an example of a Python **module**: A collection of related code, packaged together. If you think this sounds a lot like a library, [you are not alone](http://stackoverflow.com/questions/19198166/whats-the-difference-between-a-module-and-a-library-in-python). In broad strokes, \"library\" is a term commonly used across programming languages. The term \"module\" has a specific [definition](https://docs.python.org/2/tutorial/modules.html) within Python.\n", "\n", "Why `pandas`? It uses something called a `DataFrame` object, which can handle lots of this routine data munging type stuff for us.\n", "\n", "To use a module like `pandas`, we first need to **import** it into our program." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's going on here? The `import` statement is used to access code in a module. Using `as` lets us use whatever name we like for the module we've just pulled in. We could name it anything we like: `import pandas as santasBeard` would work just as well. \n", "\n", "Why `pd`? It's conventional (lots of Python programmers use it), it's convenient, and it's distinctive enough that we won't likely want to use these two letters for something else in our code.\n", "\n", "Now lets play with a `DataFrame`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "elfStats = [{\"name\":\"Buddy\", \"height\": 75, \"weight\": 185},\n", " {\"name\":\"Papa\", \"height\":48, \"weight\":105}]\n", "elfStatsDf = pd.DataFrame(elfStats)\n", "print elfStatsDf" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, a little syntax lesson on the equals sign `=`, also called the assignment operator. To the Python interpreter, it says two things:\n", "\n", " - Evaluate whatever's to my right\n", " - Give it the name to my left\n", " \n", "This is the syntax used by Python, and many other languages, to assign names to **variables**. Variables are how we store a value so we can reference it elsewhere in the program, a convenient sort of shorthand.\n", "\n", "Next, a bit on the different brackets. Square brackets `[]` usually indicate some sort of data structure that is, intuitively, like a list. Python explicitly calls this structure a 'list', but other languages use other terms, and may use 'list' differently. Curly brackets `{}` create a Python `dict`, or **dictionary**. Lists are a collection of things in order; dicts store data using **attributes**. \n", "\n", " - To a list, you'd say, \"Give me your third item.\"\n", " - To a dict, \"Give me your `name` attribute.\"\n", " \n", "Using `print` lets us see what a `DataFrame`, or anything else, looks like up close and personal. \n", "\n", "Notice that the `DataFrame` resembles the row-and-column structure of spreadsheet. Remember all those times you've summed a column in Excel? `pandas` can do that too:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print elfStatsDf.sum()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "height 123\n", "name BuddyPapa\n", "weight 290\n", "dtype: object\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "`.sum()` is a special type of **function**. Functions are just like the ones you studied in math class:\n", "\n", "$$f(x) = x^2$$\n", "\n", "Python would do the same thing, like this:\n", "\n", " def square(x):\n", " return pow(x,2)\n", "\n", "Intuitively, functions are just things we rely on to complete tasks. They can be simple, like `square`, or much more complex. Most importantly, we use them to accomplish things we need to do time and again. They're our way of taking a bunch of steps, blocking them together, and saving them in a form we can reuse later. Some functions return a value, but there are some times when they just run a few operations and don't return anything, like this:\n", "\n", " def printFullName(person):\n", " print person[\"firstName\"], \" \", person[\"lastName\"]\n", "\n", "Remember when we discussed `dict` attributes above? This function demonstrates how to access an attribute of a dict: Place the name of the attribute, in quotes, in square brackets after the name of the dict. This is called **indexing**. In many programming languages, brackets indicate an index operation. Brackets say:\n", "\n", " - Look for the term or integer inside me, but only look within the object to my left\n", "\n", "Items in a list are indexed by integer values, i.e. the order they appear in the list. Attributes in a dictionary are indexed by their **key**, a string.\n", "\n", "Back to `.sum()`. In particular, `.sum()` is a **method** belonging to the `DataFrame` **class**. This is where definitions can get confusing, because they all sort of reference each other. Here's a rough and tumble guide to keeping them straight:\n", "\n", " - A class is a blueprint for something called an instance. Your bike is an instance; the class is the set of specifications used to build it.\n", " - Not every bike is identical, even if they're the same model. Yours might be a stock blue, and someone else's might be green. Maybe someone else raised the seat. That is, objects have **attributes**, and each instance of an object has their own.\n", " - Methods associated with the class perform operations. Frequently, these operations rely on some data attribute of the class instance. `andrewsBike.getColor()` might tell you the color of Andrew's bike. Call `andrewsBike.setColor(\"blue\")`, and you just painted his bike.\n", "\n", "So in the parlance we just outlined above, we want to create an instance of the class `DataFrame` that uses the data in `santaSample.csv`, then use some methods to perform operations on that data.\n", "\n", "Does `pandas` include a method to do this? Google around a bit and try to find out. If you don't have any luck, start [here](http://pandas.pydata.org/pandas-docs/stable/io.html). " ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "# toyData = ?\n", "print toyData" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " Elf Toy Count\n", "0 Buddy Etch-a-Sketch 2\n", "1 NaN Jack in the Box 7\n", "2 Jovie Etch-a-Sketch 14\n", "3 NaN Jack in the Box 25\n", "4 NaN Wooden Horse 6\n", "5 Papa Etch-a-Sketch 435\n", "6 NaN Jack in the Box 680\n", "7 NaN Wooden Horse 39\n", "8 Gary Easter Etch-a-Sketch 415\n", "9 NaN Jack in the Box 675\n", "10 NaN Wooden Horse 184\n", "\n", "[11 rows x 3 columns]\n" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "See? Just like a spreadsheet. We even still have those pesky blank cells, expect they're replaced with `NaN`. This stands for \"not a number\"; by default, `pandas` treats blank cells as `NaN`. (My dad builds helicopters. Apparently assembling Black Hawks and wooden horses are transferable skills\u2014who knew?) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## \"Read the docs\"\n", "\n", "As you learn to program, you'll spend tons of time on the web: You'll read tutorials, look through examples, and read Q&A sites\u2014primarily Stack Overflow. Sooner or later, you'll come across a refrain, \"Read the docs.\" It's an almost-pleasant way of telling someone that the answer to their question is answered in the library documentation. The documentation is always the first place to turn for an answer to a question about a library or module.\n", "\n", "Since we've got our data properly represented, now is a good time to look through the [`DataFrame` docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame) for a method that'll help replace those `NaN` values. \n", "\n", "(Hint: The operation we're trying to accomplish here is called a 'forward fill.')" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\n", "print toyData" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " Elf Toy Count\n", "0 Buddy Etch-a-Sketch 2\n", "1 Buddy Jack in the Box 7\n", "2 Jovie Etch-a-Sketch 14\n", "3 Jovie Jack in the Box 25\n", "4 Jovie Wooden Horse 6\n", "5 Papa Etch-a-Sketch 435\n", "6 Papa Jack in the Box 680\n", "7 Papa Wooden Horse 39\n", "8 Gary Easter Etch-a-Sketch 415\n", "9 Gary Easter Jack in the Box 675\n", "10 Gary Easter Wooden Horse 184\n", "\n", "[11 rows x 3 columns]\n" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perfect! Now each row is a full record. We just need to filter out the rest of those other, irrelevant toys. To do that, we'll use a `pandas` **boolean array**. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "print toyData['Toy'] == \"Wooden Horse\"" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 True\n", "5 False\n", "6 False\n", "7 True\n", "8 False\n", "9 False\n", "10 True\n", "Name: Toy, dtype: bool\n" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's essentially a big list of whether or not a certain condition is true for each row in the `DataFrame`. Cool, but the big win is that `pandas` lets you use these for indexing." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "toyData = pd.read_csv('santaSample.csv')\n", "toyData = toyData.ffill()\n", "toyData = toyData[toyData['Toy'] == \"Wooden Horse\"]\n", "print toyData" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " Elf Toy Count\n", "4 Jovie Wooden Horse 6\n", "7 Papa Wooden Horse 39\n", "10 Gary Easter Wooden Horse 184\n", "\n", "[3 rows x 3 columns]\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "See what happened there? When we place a boolean array in between `[]` after a `DataFrame`, we get only the `True` rows. \n", "\n", "Typical Python indexing looks like this:\n", "\n", " elves = [\"Buddy\", \"Papa\", \"Gary\"]\n", " print elves[1]\n", " >>> 'Papa'\n", "\n", "In addition to single indexes, Python supports ranges:\n", "\n", " print elves[0:2]\n", " >>> ['Buddy', 'Papa']\n", "\n", "But that's about all the indexing Python supports out of box. This `True` and `False` stunt we're pulling is a `pandas` addition.\n", "\n", "Thanks to `pandas` extra muscle, we've got almost exactly what we want! Let's get rid of that `Toy` column: We already know we're only interested in wooden horses. To do this, we'll leverage a little more of `pandas` indexing muscle." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "toyData = pd.read_csv('santaSample.csv')\n", "toyData = toyData.ffill()\n", "toyData = toyData.ix[toyData['Toy'] == \"Wooden Horse\", [\"Elf\", \"Count\"]]\n", "print toyData" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " Elf Count\n", "4 Jovie 6\n", "7 Papa 39\n", "10 Gary Easter 184\n", "\n", "[3 rows x 2 columns]\n" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas` uses the `.ix` attribute of a `DataFrame` instance to do some more heavy-duty indexing. Notice that there's a comma in the bracketed section. Before the comma, we place the row index like we did above; after the comma, we give a list of columns to include.\n", "\n", "Now, we just need to write it to file." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "toyData = pd.read_csv('santaSample.csv')\n", "toyData = toyData.ffill()\n", "toyData = toyData.ix[toyData['Toy'] == \"Wooden Horse\", [\"Elf\", \"Count\"]]\n", "toyData.to_csv(\"santaSampleFiltered.csv\", index = False)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "That `index = False` in the last line tells `to_csv` not to save the row index of the `DataFrame`. The row index is much like the numbers you'd see to the left of a sheet in a spreadsheet program. If you take a look at the docs, you'll see `index = True` in the `to_csv` arguments. This means that `index` is an **optional argument**: The user is free not to pass it to the function, but it has a default value to use when the user makes that choice." ] } ], "metadata": {} } ] }