computing - Stringfest Analytics

Here’s how R and Python think differently about data

George Mount — Sat, 08 Jan 2022 17:04:57 +0000

I’m a big believer that data analysts will derive far more value from their tools when they understand the underlying philosophy and worldview of those tools.

For example, the way that open source code is created and maintained gives it distinct advantages and disadvantages over proprietary software. And if analysts can understand the differences in scope between Power Query and Python based on an ancient fable, they’ll be prone to make the right choice for their circumstances.

One of the most common questions I get from analysts is “Should I learn Python or R?” I don’t really have a one-size-fits-all answer: it depends on experience, anticipated use cases, personal taste and more.

What I can help show is how R and Python have two very different origin stories, and that this influences how each operates on data.

Most of my audience, and most analysts in general, come to this with Excel experience, so let’s take a look there first:

How Excel thinks about data

This isn’t rocket science; but it’s a helpful example: say you have a named range of data in Excel that you want to multiply by two.

Simple enough! Just pass the my_range reference into a cell formula, multiply it by 2, and you’ll get an output range with each number in the range times 2.

Vectorize all the things

What’s really going on here? Through the magic of named ranges, Excel operated on all of the values in our range at the same time. This operation is known as vectorization, and it has a lot going for it; namely, performance. This is one reason that storing your data in named ranges and tables in Excel leads to faster operations: it’s all done in one swoop.

How does this idea of vectorization play out in R and Python?

R does that too

To walk through how this works in R, take a look at the following Jupyter notebook. (To set up R to run with Jupyter on your machine, check out these instructions.)

Note: the numbered output in the above examples is not typical for R. For example, you will not see this in RStudio…

R was born as raised as a statistical programming language. That means it’s “programmed,” so to speak, to work with data in such a way: if you say multiply a vector by two, it multiplies each value in that vector by two, much like you would in a math problem.

Other similarities exist between R and Excel; for example, the way each program indexes items.

Python does that… with some help

By contrast, Python was built as a general-purpose scripting language to handle error logs, communicate with operating systems, and so forth. It’s a very “computer-friendly” language, which is why you see it in so many contexts ranging from web development to AI. But with that generality it has hard time “reading between the lines” of how people are actually interested in data.

You’ll see in the following example that Python doesn’t exactly vectorize out of the box…

Of course, Python simply multiplying the entire list is pretty efficient in its own right. But with the help of a package, we can easily get what we want.

Both efficient in their own special way…

Don’t get me wrong, numpy and Python in general are easy to use and learn. But Python wasn’t necessarily designed to work with data in a way that is intuitive to users, like Excel and R were. This is not a “value judgement;” these are all great tools that you would do great to learn more of.

I know multiplying a few numbers by two may not be all that relevant to what you’re looking to do with data. That said, I find this example really cuts to the quick of how the DNAs of R and Python are fundamentally different. R was born and built to do statistical analysis. Python wasn’t, although with the help of packages it works just fine.

What fundamental differences in R and Python (or Excel) have you observed? How do these influence your thinking on them? Let me know in the comments.

If you’d like to continue comparing and contrasting R and Python with Excel for analytics, check out my book Advancing into Analytics.

The post Here’s how R and Python think differently about data first appeared on Stringfest Analytics.

Practice R and Python on the Cloud for Free

George Mount — Mon, 16 Dec 2019 13:24:12 +0000

R and Python, the “dynamic duo” of data science, are both free, open-source programming languages. That means that there’s no “vendor” in the sense that, say, Microsoft owns Excel. This can make getting started with these programs a little trickier: there are several ways to install them, often multi-step, confusing, and resource-intensive.

It would be easy as a brand-new programmer to give up on tools that are so involved even to install — “If that’s hard, just imaging trying to use them!”

Fortunately, free cloud-based applications exist for you to experiment with these programs, no installation needed. This saves you disk space and headaches and allows you to dig into the code — and the possibilities — rather than the logistics.

For R: RStudio Cloud

RStudio Cloud comes from RStudio, vendor of the predominant RStudio integrated development environment. (I use RStudio in teaching my R course.)

Simply create an RStudio account and get started. You can create a new project and run a session of RStudio from your browser. The code will execute on RStudio servers.

Your initial workspace will look like the below. This is a “virtual” instance of the RStudio interface:

If this is the first time you have worked in RStudio, check out my “Tour of RStudio” below.

To continue dabbling with R, check out my posts. Your R session will run just as it would on your computer, but this time RStudio takes care of the software.

Ready to take the plunge into R? Get started with my course, R Explained for Excel Users.

For Python: Google Colaboratory

Google hosts the free Colaboratory service for running Python using a modified Jupyter notebook. The exact “look and feel” of Colab will not be the same as using a code editor like PyCharm (my favorite environment for working in Python) or even a “plain” Jupyter notebook, but the functionality is there, plus you don’t have to deal with maintaining the software and packages.

To access Colab, log into your Google account and check out the Google Colab starter notebook, which includes the below video.

Google Colab gives you direct access to Google’s supercomputers — you can do some pretty serious data on here, as the endorsement from TensorFlow suggests (that is a popular package for deep learning built by developers at Google). You can even execute on your Google Drive files entirely from the cloud.

Conclusion: Get coding fast

If you’d like more practice getting into R and Python via RStudio and Jupyter Notebooks, with the experiences of an Excel user particularly in mind, check out my book Advancing into Analytics: From Excel to Python and R.

More about Advancing into Analytics, including how to read for free, are available here.

The post Practice R and Python on the Cloud for Free first appeared on Stringfest Analytics.

The Confidence Interval Economy: Mistakes and Career

George Mount — Sat, 01 Oct 2016 15:45:29 +0000

Excellent piece yesterday by Rob Collie at PowerPivotPro about seeing yourself as a Michelangelo of data.

This is a topic that I discussed many times that I am totally on board with. I have argued to look at Excel as a medium of expression and a way to find beauty at your cubicle.

Rob points out that back in the day painting was only for the elites.

But once the cost of paint fell, art became possible for the masses.

The same is happening with data. And data is the analyst’s paint.

It used to take massive computing power and technical know-how to analyze data. Now computing power is nearly free.

A corollary is that if computing power is cheap, then the cost of making mistakes is cheap.

In a way, making mistakes is the cost of creativity. Which brings me to the confidence interval.

We live in the confidence interval economy

One of the amazing things I learned in my first statistics class is that manufacturers don’t aim to make every product perfect.

Instead, they agree on an acceptable error rate and confidence interval. They accept that for greater things in the business, nothing can be perfect.

This is a mindset that spreadsheet reporters (and their managers) need to adopt.

Bean-counting is not bean-predicting

There is the “bean-counting” euphemism. Everything has to balance, or it is wrong.

What if we aren’t bean-counting, instead “bean-predicting” or “bean-analyzing?” Different exercise.

Some analysts want reductive or predictive models to have the same accuracy of full-blown financial reports. But these reports are not the same thing. They actually lose relevance and usefulness the more complex they become.

So what does this mean for your career?

A rambling post, but this idea of data as a medium of expression with low cost of making mistakes should shape one’s career path.

1 See your job as a creator

You are a Michelangelo of data. Stop with the “I’m an analytic type. I am not creative.” You are a designer whose medium of expression is the spreadsheet.

2 Use rapid computing to your advantage

Don’t spend too much time building the perfect solution. Use rapid prototyping to your advantage. I don’t wait to write the perfect blog post. Instead I release the “minimum viable product,” and test to see how it does.

If I see positive trends, I expand on them. If not, I pitch the result. Don’t get too hung up on duds or mistakes — it’s part of the process.

3 Find a boss who thinks in confidence intervals, not equilibria

The tricky part. Get a boss who understands how statistics works and the role of the error term. Bean-counting is quite different than bean-predicting.

The post The Confidence Interval Economy: Mistakes and Career first appeared on Stringfest Analytics.