programming - Stringfest Analytics

Python was not designed for data analysis (and why that’s OK)

George Mount — Sat, 02 Jul 2022 15:52:49 +0000

A major reason I think it’s easier for Excel users to pick up R versus Python is that these tools tend to “think” more similarly than Python. See what I mean here: let’s take a range of numbers and attempt to multiply it by two using the built-in range, vector and list objects in Excel, Python and R respectively:

multiply-range-by-two-in-excel-r-python Download

Looks pretty straightforward in Excel and R, right? Take the range, multiply by two, get each number times two. By contrast, Python does something rather different: it literally takes the range, and duplicates it (so we get eight numbers not four). Weird, right?

Well, not necessarily. Excel and R were designed for statistics and arithmetic. Python was designed more generally to communicate with the operating system, process errors, and so forth. The way a program ought to “think” for these tasks is rather different than for analyzing data.

“You’re crazy, bud. Python’s cleaning up in the data space right now,” you may be thinking (pun intended). That’s true. And it’s with the help of a fantastic set of packages to make analyzing data there feel a lot more natural (You may have heard of some of these: pandas, scikit-learn, numpy, etc.).

This post isn’t a takedown of Python or endorsement or R. You could never pick a favorite child. It’s just an exploration of how software objectives inform software behavior, with a very simple example.

To get started with this great set of tools for data analysis, check out my book Advancing into Analytics.

The post Python was not designed for data analysis (and why that’s OK) first appeared on Stringfest Analytics.

Here’s how R and Excel think similarly about data

George Mount — Mon, 17 Jan 2022 20:08:51 +0000

If you’re an Excel user interested in building one-of-a-kind visualizations and stretching your analytics limits, R is a great program to add to your toolkit. It’s also a natural stepping-stone into programming, as R’s core data structures are quite similar to Excel’s.

Here’s how:

Excel’s named ranges are a lot like R’s vectors

Named ranges in Excel

We’ll start the comparisons with Excel’s named ranges. This is a fantastic Excel feature that will let you refer to any given range of cells with one object name.

For example, I’ve created named ranges my_numbers and my_strings containing the sets of numbers and text in the following example. Find a blank cell in the worksheet and type =my_numbers * 2. What happens?

The gist here is that when we define this data as a named range, we can refer to and operate on all that data in one fell swoop. No more absolute references and dragging-and-dragging formulas over to operate on a range.

This works for operations on numbers just as well as on strings. Check out the following examples and try it for yourself:

Vectors in R

Through the magic of named ranges in the previous example, Excel operated on all of the values in our range at the same time. This operation is known as vectorization, and it’s a powerful concept in numerical computing.

Guess what other program uses vectorization extensively? That’s right, R. In fact, R’s core building block is even called a vector, and it’s very similar to an Excel named range.

In the following example, we’ll combine multiple numbers into one R vector using the c() function (standing for combine). This is very similar to naming a range of cells in Excel. We can then operate on the data very similarly to how we would in Excel.

Go ahead and click “Run” to see the results for yourself:



		# This will get executed each time the exercise gets initialized

my_vector <- c(5, 6, 7, 3, 1)

# Create a vector

my_vector <- c(5, 6, 7, 3, 1)
# Multiply vector
my_vector * 2
# Take square root of vector
sqrt(my_vector)



		success_msg("Great job!")

Use the assignment operator (<-) to create the variable a.

And yes, in case you were wondering, this will work on text just as well:



		# This will get executed each time the exercise gets initialized

my_vector <- c('you', 'are', 'an', 'awesome', 'analyst!')

# Create a vector

my_vector <- c('you', 'are', 'an', 'awesome', 'analyst!')
# Convert to uppercase
toupper(my_vector)
# Get number of characters
nchar(my_vector)



		success_msg("Great job!")

Use the assignment operator (<-) to create the variable a.

So far we've been working on so-called one-dimensional ranges of data (i.e. rows only). Now it's time to move to two dimensions: rows and columns. Vectorization is a powerful concept... too powerful not to use here too. The similarities between Excel and R continue.

Tables are like data frames

Excel Tables

If you're not already using Excel tables... do it! This is one of my favorite Excel features of all time, and one that an alleged single-digit percentage of Excel users are taking advantage of:

Excel tables not only look great, they make operating on data a breeze for much the same reason that ranges do: you can work on a whole series of cells at once, rather than one at a time.

The following example has a very small roster table with some basic operations: finding a mean, indexing the table, and creating a calculated column.

If you've not worked with table before, the syntax may be a little strange, but keep with it: to get the average height, we took the average of the height column of the roster table like so:

=AVERAGE(roster[height])

To index the contents of the table to get, for example, the value of the cell in the third row and second column, all we needed to do was use the INDEX() function:

=INDEX(roster, 3, 2)

Finally, I created a calculated column to convert player heights from inches to feet. This is another vectorized operation! As soon as I create the calculation, it applies down the entire column. The @ column indicates to use the corresponding row of the height column as the basis for reference:

=[@height]/12

(If you want to try this last one for yourself especially, I suggest downloading your own copy as it's hard to recreate using the embedded feature.)

R Data Frames

To know Excel tables is to love them... so does it follow that to know R is to love it, because its data frames are so similar? I'll let you be the judge with the following example.

A few things to know about the below syntax:

The $ in R can be used to access any individual column of a data frame. Go ahead and try it for yourself!
- This notation can be used to create new columns in a data frame, and we indeed do that by getting the height in inches.
To index the data frame, we place square brackets next to its name. The first argument is the row position we want; the second, the column position. This is so similar to what you did in Excel!

The data frame roster has been already created for you. Go ahead and run the code to see the Excel similarities come to life:



		# This will get executed each time the exercise gets initialized

roster <- data.frame(
  name = c('Jack', 'Jill', 'Billy', 'Susie', 'Johnny'),
  height = c(72, 65, 68, 69, 66),
  injured = c(FALSE, TRUE, FALSE, FALSE, TRUE))



# Average height?

mean(roster$height)

# Third row, second column?

roster[3, 2]

# Calculate height in feet

roster$height_feet <- roster$height / 12
# Print data frame -- check out the new column
roster



		success_msg("Great job!")

Use the assignment operator (<-) to create the variable a.

Recap

OK, these examples were not identical. But they're pretty similar. Here's a recap of how range-and-table-powered Excel compares to R. Vectorization FTW:

Operation	How it's done in Excel	How it's done in R
Multiply a range by 2	`=my_range * 2`	`my_range * 2`
Index a table to get the item in the 3rd row, 4th column	`=INDEX(my_table, 3, 4)`	`my_table(3,4)`
Get the average of column `x` from table `y`	`=AVERAGE(y[x])`	`mean(y$x)`
Create a calculated column that is the square root of column `x` in table `y`	`=SQRT([@x])`	`sqrt(y$x)`

The differences

Of course, the analogy between these Excel and R data types isn't perfect. One major difference: In R, every element in a vector or the column of a data frame must be of the same type. However, this shouldn't come as too big of a surprise for you Power Query users: it's the same deal there with columns.

"What about Python?"

I can hear someone on the interwebs already asking about how Python fits in here.

Depending on what you're looking to do with coding, Python may be a fine tool for you to pick up. It does things well that R is just decent at, and vice versa. All that said, I personally find Python a little harder for data analysts to pick up because it "thinks about" in quite a different way than R or Excel. You can read more about that here:

Here’s how R and Python think differently about data

Making pRogress

My major goal of this post was to show you that as an Excel user, R is well within your wheelhouse. So what questions do you have about picking up this tool? Let me know in the comments.

The post Here’s how R and Excel think similarly about data first appeared on Stringfest Analytics.

Five ways to get help in Python

George Mount — Mon, 21 Jun 2021 14:21:00 +0000

In an earlier blog post I offered five ways to get help in R. This post serves as the Python equivalent, assuming you are working in Jupyter notebooks.

0. Web search it

This one seems obvious, but is worth pointing out. If you get an error message, plug it into a search engine. If you would like to know “How to do X…” well, chances are others are looking for how to do X as well, and the search engines want you to find that content (“How to” content is some of YouTube’s most popular.).

This can of course become a rabbit hole; it’s easy to scroll listlessly through pages of results, give up, and watch cat videos on YouTube. The more you can bring the resources to you, the better your productivity will be. That’s why working in an integrated development environment is so helpful — it’s meant to put all the resources you need under one application.

Unlike RStudio, Jupyter is not quite a full IDE, so you may need to go outside the application for help more often. (If you are looking for an IDE for working with Python, check out PyCharm. RStudio now also features Python capabilities.)

All that said, you can get some basic assistance without leaving Jupyter… such as checking the help documentation.

1. Get the help documentation with `?`

To learn more about a Python function, simply place a question mark in front of its name (no parentheses) and run:

Here you can see all necessary and optional arguments of the function with additional notes and examples.

2. Check the package’s documentation

Step 0 suggested to start with a web search, but when in doubt about Python code it’s often better to go straight to the source: the package’s documentation. The Help menu links out to some of the most popular Python packages; you can find the docs for other packages with a web search.

3. Visualize your code

When it comes to data, we analysts have a simple rule: When in doubt, visualize it. The same principle can fortunately be applied to Python coding with PythonTutor.com. This free service will visually step through the code so you see exactly how inputs transform to outputs.

If your code is throwing errors and you just can’t pinpoint where in the process things are breaking down, check out this tool.

Try your own code or use my example from the previous:

4. Compose a minimally reproducible example

If you’ve gone through the steps so far are still stumped… first, take some time away from the problem and come back. It’s amazing how often getting this distance works.

Second… did you restart your computer? (That works often, too!)

If all that fails, you may be approaching the point at which you need to “phone a friend…” or at least some person on the internet.

Now, if you are going to do that, you want to make your problem as easy as possible to follow along with: preferably by including a copy-and-pasteable code program containing a small dataset, what you are trying to achieve, the steps you are taking and where you are failing. This is known as a minimally reproducible example (MRE).

If you’ve worked in R you may be familiar with the datasets that ship out of the box like iris or mtcars. These are great datasets to use for an MRE because everybody’s got them. Python doesn’t ship with any datasets, so you can either make your own or find a package that includes some.

I’d suggest using the seaborn data visualization package, which comes with some standard datasets. Some of these are pretty large, so if you’d like to make your MRE more “minimal” you could even just take a few lines of the dataset like so:

There’s more to a good MRE than a dataset, so check out this post for more tips.

5. … then hit the forums

OK, you’ve searched the web, the help documentation, tried stepping through your code and wrote an MRE. It’s time to hit the forums.

There are so many great user forums out there; r/Python and StackOverflow come to mind. But you want to do your homework beforehand; the latter in particular is notorious for chewing out unprepared posters. It’s understandable — people aren’t there to provide pro bono consulting — but it can feel a little jarring.

More recently, groups have sprung up on Slack and Discord. These are also great places to network, get and give help.

6. What else?

As mentioned previously, it’s smart to have go-to resources or procedures so that when you get blocked you don’t drift aimlessly. At the same time, there’s almost always a better way of doing anything so you want to keep your steps open for improvement.

In that spirit, what go-to steps do you take when you need help in Python? Please share them in the comments.

Video notebook

The notebook used in the video demonstration follows.

The post Five ways to get help in Python first appeared on Stringfest Analytics.

Teaching coding: what is pair programming?

George Mount — Tue, 01 Jun 2021 09:32:00 +0000

Some say that learning is a team effort, and pair programming makes that an explicit part of learning how to code. Here’s how it works, the pros and cons.

How it works

There are no rubrics or packages needed to teach coding with pair programming. This is a practice that comes from software development where two individuals work together on coding. They trade off on roles, which are:

The Driver

The driver is the one “behind the wheel,” or the keyboard as it were. The driver pushes ahead toward the destination, writing, running and inspecting the code.

The Navigator

We’re mostly used to the “driver” role because that’s what we do when we work by ourselves, right? We push ahead toward our goals.

What makes pair programming different is the presence of a navigator. This person is in co-pilot position, making sure the driver stays on a good course. They help the driver make adjustments to the route mid-journey. This of this as a real-time code review.

Each owns the project

The driver and navigator are equally in charge of the output! Neither role is more important. The driver and navigator also changes places, perhaps trading off on days or activities while learning.

Teaching coding: What is pair programming? from George Mount

The advantages

Pair programming forces students to think out loud about their work. It also helps students give and receive feedback, which is a critical skill for data analytics, much like most other fields.

The disadvantages

Bringing up the phrase “pair programming” may elicit eye rolls from seasoned programmers, as this Dilbert strip illustrates.

Dilbert takes on pair programming. Source: Dilbert.com

Many programmers find the practice stifling and exhausting. And then there are the personality clashes: if you’ve ever got lost on a road trip with a friend and bickered over what to do… well, imagine that but with code.

These weaknesses can transfer into education as well: students who prefer to work independently may resent pair programming, and some students may struggle to adopt the collegial environment needed for successful pairing.

For these reasons, pair programming may be best used as an occassional aide. That said, these pain points are often the rule, not the exception, in real-life data projects, so the early exposure and practice dealing is not a bad idea.

Learning guide: Introduction to R, one-day workshop

George Mount — Mon, 14 Sep 2020 21:15:00 +0000

The below download is part of my resource library. For exclusive free access, subscribe.
If your organization is interested in this or other analytics training, get in touch.

When I was an undergrad, a professor suggested I learn this statistical programming language called R.

I took one look at the interface, panicked, and left.

A lot has changed in the R world since then, not the least of which was the release of the RStudio integrated development environment. While the universe of R packages continues to grow, and the work can now be done from the comfort of RStudio, the fact remains: learning R means learning to code R.

Many of my students have never coded before, although this is a half-truth: they’ve probably used Excel, which requires a decent amount of functions and references. What Excel doesn’t require, though, is naming and manipulating variables.

R is an ideal choice for first-time data coders: the familiar tabular data frame is a core structure. Operations are designed with data analysis in mind: after all, R is a statistical programming language. (In my opinion, this makes it preferred to Python, which was designed as a general-purpose scripting language — again, as far as learning to code as a data analyst goes.)

I assume no prior coding language for this workshop. My goals are to equip students to work comfortably from the RStudio environment, ingest and explore data, and make simple graphical representations of data. In particular, students will perform the most common tabular data cleaning and exploration tasks using the dplyr library.

Above all these objectives, however, is my goal to help students not panic over learning R, like I did when I started.

You are welcome to use this learning guide as you see fit.

Download the learning guide here.Download

1: Welcome to the R Project

Objective: Student can install and load an R package

Description:

What is R and when would I use it?
R plus RStudio
Installing and loading packages

Exercise: Install a CRAN task view

Assets needed: None

Time: 35 minutes

Lesson 2: Introduction to RStudio

Objective: Student can navigate the RStudio integrated development environment

Description:

Basic arithmetic and comparison operations
Saving, closing and loading scripts
Opening help documentation
Plotting graphs
Assigning objects

Exercises: Practice assigning and removing objects

Assets needed: None

Time: 40 minutes

Lesson 3: Working with vectors

Objective: Student can create, inspect and modify vectors

Description:

Creating vectors
Vector operations
Indexing elements of a vector

Exercises: Drills

Assets needed: None

Time: 35 minutes

Lesson 4: Working with data frames

Objective: Student can create, inspect and modify data frames

Description:

Creating a data frame
Data frame operations
Indexing data frames
Column calculations
Filtering and subsetting a data frame
Conducting exploratory data analysis on a data frame

Exercises: Drills

Assets needed: Iris dataset

Time: 70 minutes

Lesson 5: Reading, writing and exploring data frames

Objective: Student can read, write and analyze tabular external fines

Description:

Reading and writing csv and txt files
Reading and writing Excel files
Exploring a dataset
Descriptive statistics

Exercises: Drills

Assets needed: Iris dataset

Time: 40 minutes

Lesson 6: Data manipulation with dplyr

Objective: Student can perform common data manipulation tasks with dplyr

Description:

Manipulating rows
Manipulating columns
Summarizing data

Exercises: Drills

Assets needed: Airport flight records

Time: 50 minutes

Lesson 7: Data manipulation with dplyr, continued

Objective: Student can perform more advanced data manipulation with dplyr

Description:

Building a data pipeline
Joining two datasets
Reshaping a dataset

Exercises: Drills

Assets needed: Airport flight records

Time: 50 minutes

Lesson 8: R for data visualization

Objective: Student can create graphs in R using visualization best practices

Description:

Graphics in base R
Visualizing a variable’s distribution
Visualizing values across categories
Visualizing trends over time
Graphics in ggplot2

Exercises: Drills

Assets needed: Airport flight records

Time: 70 minutes

Lesson 9: Capstone

Objective: Student can complete end-to-end data exploration project in R

Assets needed: Baseball records

Time: 40 minutes

This download is part of my resource library. For exclusive free access, subscribe below.

The post Learning guide: Introduction to R, one-day workshop first appeared on Stringfest Analytics.

Renaming all files in a folder in R

George Mount — Mon, 28 Oct 2019 09:50:05 +0000

I hate the way files are run in a camera. While it was cool to learn for this post that DSCN stands for “Digital Still Capture – Nikon,” it means nothing to me!

For this post, I will be renaming the files that I took from Worden Ledges into a more “human-readable” name.

Ready to “automate the boring stuff with R?” Check out my course, R Explained for Excel Users.

Vectorization for the efficiency

I thought that this would be a loop and even an apply() function, but it turns out all that’s needed is a list of the file names. To rename the files, we will simply list all the current files, list the names of the new files that we want, then switch them around.

1. List files in the folder

I have saved these photos under C:/Ledgeson my computer. Using the list.files()function, I see them all “listed.”

Actually, this is a vector, not a list, which is its own thing in R. This will make a big difference later on. Too bad that vector.files() doesn’t quite have the same ring!

> old_files <- list.files("C:/Ledges", pattern = "*.JPG", full.names = TRUE)
> old_files
 [1] "C:/Ledges/DSCN7155.JPG" "C:/Ledges/DSCN7156.JPG" "C:/Ledges/DSCN7157.JPG" "C:/Ledges/DSCN7158.JPG"
 [5] "C:/Ledges/DSCN7160.JPG" "C:/Ledges/DSCN7161.JPG" "C:/Ledges/DSCN7162.JPG" "C:/Ledges/DSCN7163.JPG"
 [9] "C:/Ledges/DSCN7164.JPG" "C:/Ledges/DSCN7165.JPG" "C:/Ledges/DSCN7166.JPG" "C:/Ledges/DSCN7167.JPG"
[13] "C:/Ledges/DSCN7168.JPG" "C:/Ledges/DSCN7169.JPG" "C:/Ledges/DSCN7170.JPG" "C:/Ledges/DSCN7171.JPG"
[17] "C:/Ledges/DSCN7172.JPG" "C:/Ledges/DSCN7174.JPG" "C:/Ledges/DSCN7175.JPG" "C:/Ledges/DSCN7176.JPG"
[21] "C:/Ledges/DSCN7177.JPG" "C:/Ledges/DSCN7178.JPG" "C:/Ledges/DSCN7179.JPG" "C:/Ledges/DSCN7180.JPG"
[25] "C:/Ledges/DSCN7181.JPG" "C:/Ledges/DSCN7182.JPG" "C:/Ledges/DSCN7183.JPG" "C:/Ledges/DSCN7184.JPG"
[29] "C:/Ledges/DSCN7185.JPG" "C:/Ledges/DSCN7186.JPG"

2. Create vector of new files

Now we can name all the new files that we want. For example, instead of DSCN7155.JPG, I want a file name like ledges_1.JPG.

Using 1:length(old_files)gives us a vector of the exact same length as old_files.

I have saved these in the folder C:/LedgesR.

> new_files <- paste0("C:/LedgesR/ledges_",1:length(old_files),".JPG")
> new_files
 [1] "C:/LedgesR/ledges_1.JPG"  "C:/LedgesR/ledges_2.JPG"  "C:/LedgesR/ledges_3.JPG" 
 [4] "C:/LedgesR/ledges_4.JPG"  "C:/LedgesR/ledges_5.JPG"  "C:/LedgesR/ledges_6.JPG" 
 [7] "C:/LedgesR/ledges_7.JPG"  "C:/LedgesR/ledges_8.JPG"  "C:/LedgesR/ledges_9.JPG" 
[10] "C:/LedgesR/ledges_10.JPG" "C:/LedgesR/ledges_11.JPG" "C:/LedgesR/ledges_12.JPG"
[13] "C:/LedgesR/ledges_13.JPG" "C:/LedgesR/ledges_14.JPG" "C:/LedgesR/ledges_15.JPG"
[16] "C:/LedgesR/ledges_16.JPG" "C:/LedgesR/ledges_17.JPG" "C:/LedgesR/ledges_18.JPG"
[19] "C:/LedgesR/ledges_19.JPG" "C:/LedgesR/ledges_20.JPG" "C:/LedgesR/ledges_21.JPG"
[22] "C:/LedgesR/ledges_22.JPG" "C:/LedgesR/ledges_23.JPG" "C:/LedgesR/ledges_24.JPG"
[25] "C:/LedgesR/ledges_25.JPG" "C:/LedgesR/ledges_26.JPG" "C:/LedgesR/ledges_27.JPG"
[28] "C:/LedgesR/ledges_28.JPG" "C:/LedgesR/ledges_29.JPG" "C:/LedgesR/ledges_30.JPG"

3. Copy from old files to new files

Now all we’ll do is copy the files from the old file locations to the new. A TRUEoutput indicates a successful transfer.

> file.copy(from = old_files, to = new_files)
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[22] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Now we can open and see that these have more user-friendly names.

One nice thing is that because the original files are named sequentially (i.e., DSCN7155 comes before DSCN7156, etc.), so will our new files (i.e., they become ledges_1, ledges_2, etc.).

4. Clear out the old files

There is no Ctrl + Z on deleting files in R! That’s why I like to copy our files to a new location before deleting the source. We can remove the old files with the file.remove()function.

> file.remove(old_files)
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[22] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

More than one way to name a file

There are doubtless other (likely even better) ways to do this in R, so how would you do it? One candidate might, for example, be file.path(); however, I found paste0() to work a little more exactly in what I wanted.

The complete code is below.

The post Renaming all files in a folder in R first appeared on Stringfest Analytics.

Writing Code to Read Quotes About Writing Code

George Mount — Thu, 11 Oct 2018 23:10:41 +0000

A recent project of mine has been setting up a Twitter bot on innovation quotes. I enjoy this project because in addition to curating a great set of content and growing an audience around it, I have also learned a lot about coding.

From web scraping to regular expressions to social media automation, I’ve learned a lot collecting a list of over 30,000 quotes related to innovation.

Lately I’ve been turning my attention to finding quotes about computer programming, as digital-savvy is crucial to innovation today. These exercises prove great blog post material and quite “meta,” too… writing code to read quotes about writing code. I will cover one of what I hope to make a series below. For this example…

Scraping DevTopics.com’s “101 Great Computer Programming Quotes”

This is a nice set of quotes but we can’t quite copy-and-paste them into a .csv file as in doing so each quote is split across multiple rows and begins with its numeric position. I also want to eliminate the quotation marks and parentheses from these quotations as stylistically I tend to avoid them for Twitter.

While we might despair about the orderliness of this page based on this first attempt, make no mistake that there is well-reasoned logic running under the code with its HTML, and we will need to go there instead.

Part I: Scrape

To do this I will load up the rvest package for R and SelectorGadget extension for Chrome.

I want to identify the HTML nodes which hold the quotes we want, then collect that text. To do that, I will initialize the SelectorGadget, then hover and click on the first quote.

In the bottom toolbar we see the value is set as li, a common HTML tag for items of a list.

Knowing this, we will use the html_nodes function in R to parse those nodes, then html_text to extract the text they hold.

Doing this will return a character vector, but I will convert it to a dataframe for ease of manipulation.

Our code thus far is below.

#initialize packages and URL
library(rvest)
library(tidyverse)
library(stringr)

link <- c("http://www.devtopics.com/101-great-computer-programming-quotes/")

#read in our url
quotes <- read_html(link)

#gather text held in the "li" html nodes
quote <- quotes %>% 
  html_nodes("li") %>% 
  html_text()

is.vector(quote)

#convert to data frame
quote <- as.data.frame(quote)

Part II: Clean

Gathering our quotes via rvest versus copying-and-pasting, we get one quote per line, making it more legible to store in our final workbook. We’ve also left the numerical position of each quote. But some issues with the text remain.

First off, looking through the gathered selection of text, I will see that not all text held in the li node is a quote. This takes some manual intervention to spot, but here I will use dplyr’s slice function to keep only rows 26 through 126 (corresponding to 100 quotes).

We still want to eliminate the parentheses and quotation markers, and to do this I will use regular expression functions from stringr to replace them.

a. Replace “(“, “)”, and ““” with “”

This is not meant as a comprehensive guide to the notorious regular expression, and if you are not familiar I suggest Chapter 14 of R for Data Science. So I assume some familiarity here as otherwise it becomes quite tedious.

Because “(” and “)” are both metacharacters we will need to escape them. Placing these three characters together with the “or” pipe (|) we then use the str_replace_all function to replace strings matching any of the three with nothing “”.

b. Replace “”” with ” “

The end of a quotation is handled differently as we need a space between the quotation and the author; thus this expression is moved to its own function and we use str_replace to replace matches with ” “.

Bonus: Set it up for social media

Because I intend to send these quotes to Twitter so I will put a couple finishing touches on here.

First, using the paste function from base R, I will concatenate our quotes with a couple select hashtags.

Next, I use dplyr’s filter function to exclude lines that are longer than 240 characters, using another stringr function, str_length.

The quote for Part II is displayed below.

#get the rows I want
quote <- slice(quote, 26:126)

#delete the characters I don't want

charsd <- c("\\(|\\)|“")

quote$quote <- str_replace_all(quote$quote,charsd,"")

quote$quote <- str_replace(quote$quote,"”"," ")

#filter lines >240 characters
quote$quote <- paste(quote$quote, "#quote #coding")
quote <- filter(quote, str_length(quote)< 240)

#write csv
write.csv(quote,"C:/RFiles/tech2quotes.csv")

Finally, find the complete code below.

From web scraping to dataframe manipulation to regular expression, this exercise packs a punch in dealing with real-world unstructured text data — and it comes with some enjoyable reading, too.

I hope this post inspires you to tackle the world of text, and I plan to walk through a couple more of these.

The post Writing Code to Read Quotes About Writing Code first appeared on Stringfest Analytics.

R Functions for Reproducible Data Frames

George Mount — Fri, 10 Aug 2018 22:07:31 +0000

While there are many great resources to get help in R, sometimes you just need a second opinion. Here is where the many Internet help boards come in handy, most notably Stack Overflow.

Start posting on Stack Overflow and you will soon learn the importance of the minimum reproducible example (MRE). Without one, you will likely even be refused “service.”

So, what is an MWE? It is fairly self-descriptive — the smallest possible example that contains all the information necessary (in this case, for someone to help you with your code). Here’s a great walkthrough on the topic written specifically for R coding (fittingly posted to Stack Overflow).

In this example we are focusing on setting up a minimally reproducible data set, in our case a data frame. The above post suggests to use R’s built-in data frames to build an MWE, which is a great idea — in fact it negates the need for what we are going to do, which is sampling from these built-in data frames.

Regardless, I want to point out a cool alternative to build a minimally reproducible data frame in R. We will do this using four R functions: dput and get, then dump and source.

Dput and Dget

Let’s take the first five rows of the iris dataset. Using dput we will write the data frame iris5 to an ASCII text representation. You could then paste this code (that starts with structure()) into a help forum, and your responder can in turn assign this output to an object (I assigned mine to irisme.).

#for exampole - get first 5 rows of iris dataset

iris5 <- head(iris, 5)

#write to an ASCII text representation 

dput(iris5)

#paste it back and assign to new object      

irismre <- structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5), Sepal.Width = c(3.5, 
                                                                                   3, 3.2, 3.1, 3.6), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4), 
                          Petal.Width = c(0.2, 0.2, 0.2, 0.2, 0.2), Species = structure(c(1L, 
                                                                                          1L, 1L, 1L, 1L), .Label = c("setosa", "versicolor", "virginica"
                                                                                          ), class = "factor")), .Names = c("Sepal.Length", "Sepal.Width", 
                                                                                                                            "Petal.Length", "Petal.Width", "Species"), row.names = c(NA, 
                                                                                                                                                                                     5L), class = "data.frame")


irismre

If your dataset is big your dput output might get pretty big. Of course, try to keep your minimally reproducible dataset small — that is the reason you are doing an MWE!

Rather than getting the ASCII text representation, you could save this information to an R object instead with the “file =” argument in dput. Then read it back with dget:

#or you can write to a file

dput(iris5, file = "C:/RFiles/iris5.R")


#and read it back
irismre <- dget("C:/RFiles/iris5.R")

Dump and Source

In the above example we re-assigned the data frames to objects of our own choosing. With dump and source, R will save and load the object by their original names. So, in our example we save the file as the object name “iris5,” and when we load it back with source and list the objects in our environment with ls(), we will see iris5 again, even after removing it from our environment with rm().

#or use dump and source to keep the object same name

x <- dump("iris5", file = "C:/RFiles/data.R")
rm(iris5)

source("C:/RFiles/data.R")
ls()

Complete code below:

The post R Functions for Reproducible Data Frames first appeared on Stringfest Analytics.

A Tour of RStudio

George Mount — Tue, 07 Aug 2018 11:40:04 +0000

In a previous post I explained how to install RStudio, a popular integrated development environment for the R programming language.

Open up RStudio for the first time and it might look like some mad scientist’s Mission Control. In this post I will walk through each pane and what it does. From this you will start to see just what is so “integrated” about this integrated development environment.

0. The blank slate

When you open up RStudio for the first time you should see something like the above. If you do not see a window on the upper-left hand side, open a new script with the keyboard shortcut Ctrl + Shift + N on Windows.

What do all these panes mean? We will cover what each of them do.

1. The console

In the lower left-hand corner of RStudio. This is where commands are submitted to R to execute.

Here you will see the “>” sign followed by a blinking cursor. Enter your operations here and then press “Enter” to send them to the console.

In my free-mini course I dig a bit deeper into operating in the R console.

2. The script editor

While you can operate directly from the console, it’s often a good idea to write them in a script and then send them to the console. This way you can save a long-term record of the code you ran.

a. Commenting code

Another nice feature here is to leave comments in your script by starting the line with the “#” character. These lines will not be executed and are instead notes for the programmer.

b. Running code

Place your cursor anywhere in the line of the code you want to send to the console to execute.

To run this code, click the Run button at the top of the script editor. You can also use Ctrl + Enter in RStudio.

I prefer the older shortcut Ctrl + R to run code. To change your keyboard shortcut setting you can go to Tools | Modify Keyboard Shortcuts. Find the selection for “Run Current Line or Selection” and change the shortcut by keying “Ctrl + R” in instead of “Ctrl + Enter.”

You can also run multiple lines of code at the same time by highlighting and running all of them.

Finally, save your R script by going to File | Save or with the Ctrl + S keyboard shortcut.

3. The files/packages/plot pane

Next we move to the lower right-hand side of RStudio.

a. Getting help

The help tab in this pane is particularly useful. It returns a help file when you use a “?” in front of a function in your command prompt. (You can also use the the help() function to get more information about a given function.)

In this example we will search for the sqrt function. In the help pane we now see R’s documentation on this function.

4. The environment

Last but not least, let’s check out the Environment tab on the upper right hand side of RStudio.

a. The History tab

Here you will see a history of all the commands you have sent to the R console in your session.

You can send a line back to your editor script or to the console. You can also save the entire R history page as a special R History file.

b. The environment tab

This is a list of all loaded R objects.

Right now you will see R has an object called “x.” Where did this object come from? You created this when you copied and ran code from the plot function example.

Objects are largely what make R, R. Everything in R is stored as an object and to access values you must first assign them as objects.

Let’s go ahead and create one more object: one called “result” that holds the square root of 25.

You will see that the new object is added to our Environment tab.

You can clear all objects in your environment with the broom at the top of the Environment tab. If you wanted to clear just one object, use the rm(x) function in your R script.

Bonus! RStudio Settings

Under Tools | Global Options you have the choice to change some program settings. While you are not likely to change most of these here are a couple to note:

Pane Layout: Here you can adjust the layout of panes and windows in RStudio. I like how they are by default, but it is worth experimenting with.

Appearance: This one is fun to play around with. Here you can change the theme and appearance of your RStudio environment. Check out the many editor themes to break out of the normal black-on-white environment. (I am partial to the Merbivore theme editor — not only do I like that the black background cuts down on glare, I like the Halloween-y fonts used.)

Getting integRated

With this post I hope you are starting to see how the various panes and windows in RStudio connect. For more on R, check out my free mini-course or browse my previous posts on R.

The post A Tour of RStudio first appeared on Stringfest Analytics.

Go Open! Installing External Libraries in Python

George Mount — Mon, 17 Apr 2017 18:21:18 +0000

One of the biggest differences is Python compared to Excel is that Python is open-source. Microsoft owns and operates Excel. While you can develop your own add-ins and user-defined functions, etc., it is still a proprietary product.

By contrast, anyone can develop almost anything for Python and easily share it — Python is a totally public programming language.

This opens up so many possibilities for using Python. Think of something you can do on a computer, and someone probably has written a Python library to do that.

This open-endedness of Python also can make it tricky for newcomers to manage these libraries and packages.

Fortunately one package offers a great way to download other packages (meta, right?).

This video from ProgrammingKnowledge2’s YouTube channel will walk you through how to download PIP on a Windows machine. (I am unsure about how this compares to Mac — most of my readers use Windows for its superior Excel software.)

Having this installed on your machine will make working with Python and following along with future tutorials much easier.

The post Go Open! Installing External Libraries in Python first appeared on Stringfest Analytics.

programming - Stringfest Analytics

Python was not designed for data analysis (and why that’s OK)

Here’s how R and Excel think similarly about data

Excel’s named ranges are a lot like R’s vectors

Named ranges in Excel

Vectors in R

Tables are like data frames

Excel Tables

R Data Frames

Recap

The differences

"What about Python?"

Making pRogress

Five ways to get help in Python

0. Web search it

1. Get the help documentation with ?

2. Check the package’s documentation

3. Visualize your code

4. Compose a minimally reproducible example

5. … then hit the forums

6. What else?

Video notebook

Teaching coding: what is pair programming?

How it works

The Driver

The Navigator

Each owns the project

The advantages

The disadvantages

Read more about teaching coding and data education

Learning guide: Introduction to R, one-day workshop

The below download is part of my resource library. For exclusive free access, subscribe.If your organization is interested in this or other analytics training, get in touch.

1: Welcome to the R Project

Lesson 2: Introduction to RStudio

Lesson 3: Working with vectors

Lesson 4: Working with data frames

Lesson 5: Reading, writing and exploring data frames

Lesson 6: Data manipulation with dplyr

Lesson 7: Data manipulation with dplyr, continued

Lesson 8: R for data visualization

Lesson 9: Capstone

This download is part of my resource library. For exclusive free access, subscribe below.

Renaming all files in a folder in R

Vectorization for the efficiency

1. List files in the folder

2. Create vector of new files

3. Copy from old files to new files

4. Clear out the old files

More than one way to name a file

Writing Code to Read Quotes About Writing Code

Part I: Scrape

Part II: Clean

R Functions for Reproducible Data Frames

Dput and Dget

Dump and Source

A Tour of RStudio

0. The blank slate

1. The console

2. The script editor

3. The files/packages/plot pane

Suggested post: “5 Ways to Get Help in R”

4. The environment

Go Open! Installing External Libraries in Python

1. Get the help documentation with `?`

The below download is part of my resource library. For exclusive free access, subscribe.
If your organization is interested in this or other analytics training, get in touch.