learn to code - Stringfest Analytics

Python for Excel users: How to avoid common pitfalls

George Mount — Sun, 05 Nov 2023 15:42:16 +0000

As an Excel trainer with a focus on data analytics, I have instructed thousands of spreadsheet users in Python. This extensive experience has allowed me to analyze patterns in the learning process.

Below are several frequent errors and misconceptions I’ve observed, especially among Excel users, who often approach problems with a distinct mindset and set of expectations.

Case sensitivity

A prevalent challenge for newcomers to Excel is understanding case sensitivity. Consider the ABS() function as a straightforward example. In Excel, case sensitivity is not an issue; users are free to use any case they prefer.

Python has a similar function, abs(), but this one is case sensitive; ABS() or Abs() will not work.

In Python, this principle applies to all elements: functions, packages, variable names, and more. Python demands precision in the way inputs are defined, and case sensitivity is a key aspect of this precision.

Storing versus printing objects

Excel users often encounter difficulty grasping the distinction between assigning data to an object and displaying the contents of that object. I find it useful to liken the assignment of an object in Python to placing it inside a box. However, placing an item in a box is not equivalent to viewing its contents, which is what the ‘print’ function accomplishes.

Unlike Python, Excel simplifies this process. When you use a function in Excel, the result is immediately visible without an additional step.

Although saving data in an object before printing it might appear burdensome and somewhat repetitive, it is this very concept of data assignment that endows Python with its strength. This approach contrasts sharply with Excel, where calculations are performed on-the-fly and results are generated immediately.

Indentation

Python, unlike Excel, is a programming language that requires proper indentation to execute code blocks correctly.

Indentation does not apply to Excel formulas, as they are not written in a language that relies on indentation to organize code. Excel formulas are typically confined to single-line expressions entered directly into cells to perform calculations or manipulate data. These formulas lack the code blocks or sequences that in programming require indentation for structural organization.

Excel users generally do not encounter issues with indentation while learning elementary concepts like lists and Pandas DataFrames. The challenge arises as they advance to more complex subjects such as functions and loops, where understanding and applying indentation becomes crucial for reconciling the differences between these two environments.

Zero-based indexing

Python’s counting method may initially perplex Excel users due to its use of zero-based indexing. In Python, the indices of items in a sequence start at zero, unlike Excel’s INDEX() function, which starts at one. This notebook illustrates the differences between Excel’s INDEX() function and Python’s indexing operations:

You can refer to the source post in its entirety here:

Zero-based indexing: What it is and when you’ve seen it before

How packages work

Packages are essential to Python, especially for data analysis, and are a concept that new Excel users often find challenging. Specifically, they struggle with differentiating the workflow of installing packages from loading them.

To explain the availability, installation, and loading of packages, I find a smartphone analogy useful:

How open source packages are like smartphone apps

Like a smartphone equipped with essential utility apps such as a timer or calendar, Python includes a suite of base packages in its Standard Library. For more specialized tasks, however, you may need to install additional packages from a repository, similar to an “app store.”

Once a package is installed through a one-time download, it still requires loading every time you begin a new session before you can use it.

This process is markedly different from the typical Excel workflow. In Excel, all functions and features are readily available upon startup, except for add-ins, which can be installed and loaded visually as required.

How and where to store data

Another significant distinction between Python and Excel is their respective approaches to data access and storage.

Excel stores data directly within the workbook. When you enter data into Excel, it is embedded into the .xls or .xlsx file, making everything accessible within this single container. The workbook’s structure and its data are integrated, allowing you to move, share, or backup the entire file as a complete package of data and structure.

Conversely, Python scripts or programs, such as Jupyter notebooks, usually do not encapsulate data. Instead, they serve as tools to process, analyze, and manipulate data located externally. This data might be stored in a local file like a .csv, .txt, or .json, retrieved from a database, or sourced from an API or the web.

Python scripts act as a guide on how to interact with data, rather than serving as its repository. The data is kept in its original location, with the Python code reading from and writing to that place. This decoupling facilitates more flexible and automated data processing, as the same script can be adapted for various datasets by simply altering the data inputs.

In essence, Python offers a more decentralized experience. Scripts, programming commands, and files exist separately, contrasting with the integrated nature of Excel.

How directories work

Regarding the storage of data in files external to Python, there’s a requirement to import those files into Python for utilization. In Excel, importing data is straightforward: you simply use a point-and-click interface to find the desired workbook.

In Python environments, however, you often need to type out the file path to the file you want to use. Understanding directories is crucial in this context. It’s important to know your working directory, which is the folder where your notebook or script is active, as this location is referenced when accessing or saving files. You can manipulate the working directory and create new directories within Python using code or command-line instructions, which helps keep your files orderly and retrievable.

Excel users might initially find concepts like working directories, absolute, and relative file paths challenging. I recommend practicing with the path.isfile() function from Python’s os module to become proficient in verifying file existence across the file system. This is demonstrated in Chapter 11 of my book Advancing into Analytics where I discuss these concepts further.

What were your pitfalls?

These common challenges are what I’ve observed from my extensive experience teaching Python to Excel users. However, as the saying goes, “individual results may vary.” I’m curious to hear about the specific hurdles you encountered when transitioning from Excel to Python and the strategies you employed to overcome them.

Share your experiences in the comments below. I look forward to learning about your unique journey and exchanging tips and tricks that could benefit others.

The post Python for Excel users: How to avoid common pitfalls first appeared on Stringfest Analytics.

What data analysts signal by learning to code

George Mount — Sat, 29 Jan 2022 19:43:00 +0000

Should data analysts learn to code? This is a question “we” at the blog discuss quite often. In our estimation, the marginal benefit for the typical data analyst exceeds the marginal cost — if they’ve worked with Excel, they understand functions and object names to some extent, so making the jump isn’t as far-fetched as it may seem.

“But I’ll never use it!”

Now, a typical objection for analysts learning to code is that “they’ll never use it.” fair enough. Who didn’t ask (or think of asking) “when are we ever going to use this?” as a student.

To be fair, some analysts do just fine work without any code. If it ain’t broken, don’t fix it (or add any more downloads). Even in these cases, learning a bit about code will provide helpful context to how computers work and what good software does.

Other analysts might be interested in learning to code, but figure the window is closing on its usefulness. After all, won’t AI and other advanced technology negate any of the technical skills a current analyst needs?

I don’t think there will ever be a time where an analyst doesn’t need to understand a bit about computing. But let’s focus on the here and now. Microsoft is dumping millions of dollars so that data analysts can work with R and Python alongside its own BI tools. Would they be doing this if coding has no future?

What I’m trying to say here is that “I won’t learn to code because I’m never going to use it” may not be as airtight as it seems.

But where things really break down is (naturally) in the economics.

Employers don’t know you from Adam

Now if you’re reading this blog, you’re probably the ideal data analyst every employer is looking for. But they don’t know that. In fact, there’s a lot about your potential performance that they are pretty left in the dark about.

You may know that you don’t particularly enjoy learning new things or following complicated procedures, but an employer can easily miss out on these traits over the relatively brief hiring process.

In other words, there is information asymmetry in the employer/employee dynamics: you know things the employer doesn’t, and could possibly use that to your advantage (by getting hired when you really suck).

Break the asymmetry with signalling

Employers, being aware of this, look for sources of information that signal the underlying traits they’re looking for. After all, they can’t assess if you’re a decent analyst given the scarce resources available at the time of hiring, due to information asymmetry. This signalling is meant to break the asymmetry.

Learn more about information asymmetry at Marginal Revolution University:

Education signalling has been around

If this sounds crazy to you, think about a college diploma. How many of the classes listed on your transcript are immediately valuable to an employer? And even if they were, couldn’t you just say you know that stuff and the employer have no idea (information asymmetry)?

The diploma acts as a signal that you have at least some of the consistency and resilience that comes with a full-time job. These are again those types of traits that are easily lost under information asymmetry, that matter quite a bit.

Learning to code is an analyst’s signal

Now here’s how I see this relate to data analytics careers.

As mentioned, a lot of analysts aren’t using code in their current roles. This could be because IT doesn’t want them to, they’ve bought some fancy low/no-code data software, whatever.

Given this reality, it may seem silly for a data analyst to put their effort into learning how to code. For reasons stated earlier, I don’t think that’s true. But regardless, it’s still a great signal. You see, employers do want analysts who can:

Teach themselves new skills
Work effectively with computing
Take their professions seriously

These skills aren’t necessarily about the coding itself, or how coding can be used as an analyst, but what learning to code says about you as an analyst — or, more geekily, how it reduces the information asymmetry.

Knowing how to code signals that you’re in it to win it, even if you won’t directly use those skills much…

“But couldn’t I fake it?”

All this talk about the benefits of expressing your coding knowledge raises the question: couldn’t you just say you know how to code? Isn’t that just more information asymmetry?

Enter more signalling mechanisms: certifications, a portfolio, blog posts, etc. I’d say most employers are going to resonate most strongly with certifications, as they leverage the brand equity of an authoritative outside institution.

Learn to code, even if you don’t “use” it

For these reasons, I strongly encourage data analysts to learn a bit about coding, especially those on the hiring market. While this is anecdotal, I have seen data analysts start to bulk up their coding skills and soon after see a spike in employer interest.

Still not convinced? Burning Glass found that jobs requiring coding paid 35% more than those that don’t. So maybe treating learning to code as a mere signal doesn’t do it justice for your career. Perhaps if you learn to code, you’ll be put on more interesting and lucrative projects; who knows?

Now what?

If my rambling has you interested (or maybe it’s the Burning Glass statistic?) in learning to code as a data analyst, you’re in the right place. Be sure to subscribe to my newsletter below for access to my analytics learning resource library.

I look forward to seeing what you do with this knowledge. Happy signalling!

The post What data analysts signal by learning to code first appeared on Stringfest Analytics.

Why data analysts should learn to code

George Mount — Sat, 10 Apr 2021 17:16:42 +0000

When many hear “data analytics” these days, they think of graphical user interface (GUI)-driven business intelligence (e.g. Tableau), data warehousing (e.g. Snowflake), or data preparation (e.g. Alteryx) platforms. These tools have their place (some more than others) in the analytics stack.

But rather than focus on these tools in my book Advancing into Analytics, I teach readers the two major data programming languages, R and Python. These are often considered data science tools. Many successful analysts don’t know them and don’t see the need. But I believe data analysts have great reason to learn how to code. Here’s why:

It’s a knowledge investment that pays off

If I had a dime for every time I mentioned that analysts spend 50 to 80 percent of their time preparing data, I might not need to write a blog.

So, how do we lighten that load? Traditionally, Excel wonks have made great use of keyboard shortcuts to speed up their workflow. UX research does indeed indicate that using the keyboard shows more productivity gains than using a mouse. Let’s extrapolate that to understand that coding is generally more productive than pointing-and-clicking. Of course, it takes longer to learn the former. So this becomes a break-even decision of code versus GUI.

For the early stages of a project, or one-off needs, a GUI could be fine. But there’s something to be said for “codifying” a project such that it can be automated. In a job with this much grunt work, it’s a learning investment that pays off.

Maybe future technologies can use NLP or augmented reality to provide other options than coding versus GUI. I’m pretty impressed, for example, with the AI-embedded tools of Power BI: start typing what kind of chart you want, and it’ll make its best guess. More stuff like this, and maybe the productivity gains of code versus GUI aren’t so clear. But for now, the calculus is still favorable to learning how to code.

GUI tools are running it under the hood anyway

Now, the choice between code and GUI isn’t always so clear-cut. In fact, VBA and now Power Query offer some menu-driven tools to generate syntax. Some business intelligence tools are offering the same for machine learning, usually powered by R or Python behind the scenes.

I don’t know about you, but I nearly always have some requirement that can only be accomplished by coding in these frameworks. Every data project is different — have you ever struggled to search-engine the answer of some task you’re looking to do? GUIs tend to limit your options with the tradeoff that there’s a lower barrier to entry. This isn’t always a tradeoff that rank-and-file analysts can make.

Maybe you can think of code as the engine powering your GUI under the hood. You are a long-haul analyst and need to be able to pop the hood and make the necessary adjustments when the GUI just can’t seem to ignite the fuse. Some coding knowledge is important, even if you’re working with tools that generate it for you.

It’s exposure to open source communities

Origin stories matter, and it’s no different for software. Many low- and no-code tools are proprietary. They may be easier to use and harder to break, but someone’s dropping a lot of money for that convenience and support.

On the other hand, many data programming languages are open source. This means that anyone is free to build on, distribute, or contribute to the software. In other words: open source is great freedom and great responsibility. Closed versus open source offer opposing worldviews, and it’s important to follow the implications of either.

Every analyst should have some direct exposure to open source due to the major impacts it’s had on technology and data in the last decade. What are the pros and cons of open source? What is a package? What is a pull request? How do companies make money off open source, anyway? Analysts should be able to answer these. They can do it through direct experience in the open source world, by learning R and Python.

I see too many data analysts totally commit to the offerings of one proprietary vendor. This tends to induce myopia in what an analyst knows about (They may start to think their vendor invented the left outer join, which has been around for decades). Committing to one closed source vendor leaves so much off the table in today’s analytics landscape.

Hey, if Microsoft is playing nice with open source, then maybe you should too.

It’s the surest way to reproducible research

The idea of reproducibility in data is that given the same inputs and processes, the same outputs should result time and again.

Now, a lot of people have piled onto Excel because it’s not always reproducible. It’s easy not to know if someone deleted a column or where exactly originally someone got the data to make a graph from, before they started chopping it off into worksheets. (Power Query busts many of these Excel myths, but let’s overlook that for now.)

The solution to this downside is often to adopt some kind of expensive business intelligence tool. The irony is that while BI tools have their benefits, reproducibility isn’t necessarily one of them: they, like Excel are GUI-driven, and it can be hard to track back to the original inputs.

Programming languages like R and Python offer a universe of tools for reproducible research. If, for example, you need to conduct some statistical analysis of a dataset and report your results, I couldn’t imagine doing this anymore without coding. “Show your work” is some solid advice from grade school, and code is a great way to do that in analytics.

Advancing into coding

Yes, learning to code is a knowledge investment; it is learning a new language, after all. But like learning any language, coding will open new doors for data analysts.

If you want the least friction to open these doors, I suggest you check out my book, Advancing into Analytics: From Excel to Python and R. You may not associate these as data analytics tools, but for the reasons explained above, I find them indispensable for data analysts.

Learn more about Advancing into Analytics: From Excel to Python and R

The post Why data analysts should learn to code first appeared on Stringfest Analytics.