software - Stringfest Analytics

What is the role of Python in modern Excel?

George Mount — Tue, 21 Feb 2023 19:03:33 +0000

Between Power Query and Power Pivot alone, “modern” Excel boasts so many features that it’s impossible to learn them all. Add to that Office Scripts, Power BI, LAMBDA() and more and it’s easy to get overwhelmed.

One “Excel-adjacent” tool that can be particularly intimidating for prospective learners is Python. Shouldn’t this be the last tool on the learning list, many Excel users think, given that it’s not even a Microsoft product and it literally requires learning a new language to use?

While Python may not be the best choice for every Excel user, it’s worth serious consideration for those looking to build complex automations, version-controlled projects and other advanced development products. Here’s the role of Python in modern analytics and modern Excel.

A growing stack requires glue

When I started as an analyst, my choices started and ended with Excel. All data, reporting, dashboards, everything was under that one green roof.

Fast forward a few years and there’s Power BI, Office Scripts, Jupyter Notebooks and more. This broadening of the analyst’s tech stack is right in line with wider trends in tech: a move away from one single, monolithic application to a loosely connected suite of specialized products.

To make this architecture work, a “conductor” or “glue” is needed. Whether it’s reading a dataset in from one source and visualizing it another, or deploying a machine learning model from the cloud to an end-user’s dashboard, Python is a great choice for this glue. It’s one of the rare languages that’s used for simple amateur-written scripts to enterprise-level applications alike, and can work smoothly with a variety of operating systems and other programming languages.

Microsoft has acknowledged and celebrated Python’s role as a “glue” language, and it’s already available to use for a variety of purposes in Azure, Power BI, SQL Server and more. And while Python isn’t yet officially supported for Excel, it still has advantages. Plus, think of all the other places you’ll be able to use Python… and who else is using it!

Network effects mean faster development time

“Everybody’s doing it” isn’t usually a good reason to engage in something… but, in the case of programming languages, it may be a good enough reason.

Have you ever heard of network effects? It’s the idea that the value of something grows as more people use it. Programming languages have network effects — the more programmers there are, the more code is shared, the more code there is to use and then build on… it’s a virtuous cycle.

Because Python is such a neutral “glue” language, it’s been adopted in professions as wide ranging as database management, web development, data analytics and more. This means it’s a good chance that no matter where your Excel project takes you, or what tools you need, someone you collaborate with also “speaks” Python.

For example, maybe you develop an inventory tracker or other application using Excel. The program gets too unwieldy for a workbook, or it becomes so popular that your organization wants it to become a standalone web program. The turnaround time for this project will be much faster if the existing code is already in Python.

I find it unfortunate that data analysts and other tech professionals often work with a completely different set of tools — it can make collaboration slow and tedious. By adopting a shared language, the network effects expand and development time drops.

Bring modern development to Excel

“Software developer” has become a coveted job title over recent years, almost as sexy as “data scientist.” Sadly, the methods and best practices developed by this profession have not caught on with BI or VBA developers. At an organization level, doesn’t it seem like trouble to have two tech professionals both called “developers” going about their work in totally different methodologies?

Super honesty time, the term "BI Developer" doesn't sit well with me because most BI software doesn't provide the tooling required for true development such as version control, deployment pipelines or even a GD IDE.

— George Mount (@gjmount) April 24, 2022

Python offers the modern Excel developer to implement these best practices. These include the following:

Unit testing

Most programming languages offer automated unit tests as a way to confirm that code is performing as expected. Unfortunately, base Excel and VBA do not. There are a few workaround tools here, but Python is a solid candidate for unit testing with the network effects to boot.

Automated unit testing makes applications more reliable and less likely to break — particularly useful for Excel workbooks that users with a variety of technical backgrounds might pick up.

Version control

Another development tool de jure that Excel lacks is a version control system. This tracks changes in a repository, allowing users to, among other things, see who contributed what, when and revert back to older versions.

If you’ve ever pulled your hair out over determining the difference between budget-model-final.xlsx and budget-model-FINAL-final.xlsx, then you understand why version control might be helpful.

Excel does provide some basic version control-esque features like the ability to view version history in OneDrive and the Spreadsheet Compare add-in, but lacks many of the features that could be found by moving all code production to Python.

Package development and distribution

The previous two features, while useful for a serious developer, may not be all that interesting for a typical Excel user. Perhaps you are looking for a more immediate reason to learn Python as part of your day-to-day analysis tasks… is it still worth learning?

And to that, all I have to say is one word: packages.

Excel user: "Why should I learn Python?"
Me: pic.twitter.com/WeZlPISLdl

— George Mount (@gjmount) February 21, 2023

Don’t get me wrong, I love building things on the computer. But if someone’s already built something that does exactly what I want, I’m not reinventing the wheel. Python’s features for building and sharing packages, particularly through the Python Package Index, open a whole world of tools in a way that would be very difficult to replicate through Excel add-ins or VBA modules.

Not only that, but the pip manager and other tools provides methods for installing and managing these packages.

Whether you’re looking to collect data from the Census API, describe an image, or just have a function to generate descriptive statistics, access to Python packages is well worth the price of learning Python code. Some of these packages are even designed to work well with Excel.

Python and the future of Excel

In an AI-driven world, it can seem like learning to code is exactly the wrong choice to stay relevant. Ironically, however, as more data gets generated in more formats (including some AI-powered), the need to code may be more relevant than ever.

As an integral part of the modern analytics stack, the possibilities to integrate Python with Excel are only growing as Excel itself takes on a variety of AI-driven features.

The post What is the role of Python in modern Excel? first appeared on Stringfest Analytics.

The dangers of being a one-platform data analyst

George Mount — Wed, 09 Mar 2022 14:49:13 +0000

I love Excel. It’s a bedrock of my blog and largely my “claim to fame” as an authority. That said, early on I intended this not to just be an Excel blog, because I saw the benefits of cross-training and versatility for bigger-picture thinking.

If you’re looking to improve your data skills, I encourage you to take this same approach and not get too married to any one tool or vendor. Of course, you can’t become an expert in everything, and specializing in certain areas may make sense given your circumstances. But if every analytics tool you are using or learning comes from one vendor, that could be a problem.

Here’s why:

You’re not diversified

This one’s most apparent, but worth stating: by focusing on one software program, you are staking your future more firmly on the future of a specific vendor or community.

Sometimes, this works out fine: if you get into a program as an early adopter, you can ride the wave of demand to some handsome rewards.

Myself, I would prefer not to be entirely invested to any given tool or vendor. Maybe you’re different, if you feel very strongly in being an expert in a specific platform.

What I’m saying here is that this one may have to do more with personal preferences and calculated risks.

There are, however, more universal reasons not to focus on one platform.

You see things from one angle

We often think that dabbling in a broad range of concepts and tools is an inefficient way of learning, but that’s not always true. Research tells us that we learn new ideas by relating them to what we already know. These connections are often most insightful when they come from disparate areas.

Now, this doesn’t mean that you need to learn everything about data, ever. But it does suggest that you approach the field from different angles, and work to bridge the connections.

Many of us come into the data world from spreadsheets. Go learn other tools (And not just Power Query! We’ll get into that in a second.) and connect the dots: how do you do the equivalent of a PivotTable in R? A VLOOKUP() in SQL?

Once you can see how different tools approach the same problem, you able to tease out universal standards and methods of data analysis. This gives you range to learn even more methods.

This ability to triangulate data methods often gives you a better look into the field as a whole than if you learn just one particular tool — or, as we’ll dig into next, vendor solution.

Your vendor didn’t invent data analysis

Power Query and Power Pivot are game-changers for Excel and bust many of the myths about the tool’s limitations.

That said, while the technology of these tools may be new, the concepts are not. Left outer joins have been around for decades, along with the relational data model. These are topics that are probably best learned outside of the Microsoft platform and with a good old-fashioned SQL database.

I’ve heard it said that Power Query and Power Pivot were built so that end-users don’t need to learn SQL to do SQL-like things. This may be a good arrangement for Microsoft, but I’m not sure it’s good for users.

The problem is that users dig in on one platform such that they lose context in the field as a whole. Microsoft did not invent the data model, but I sometimes wonder if novice Power Pivot users think they did.

Solution: Buying into the stack

The most fertile data work often comes when combining multiple data tools, or blending the slices of the data analytics stack.

What is the data analytics stack? from George Mount

As of this time, no one vendor or software solution covers each stack the best. Microsoft has come a long way with the Power Platform. But even Microsoft admits it doesn’t do everything best, which is why it’s become so much more collaborative with open-source tools like R or Python.

I would suggest data analysts not get too hung up on being an expert in any one vendor’s platform, but instead having the context to navigate and blend tools. This stack will take some iterating, and it will take some crossing vendors and softwares. However, it will provide a diversified depth and breadth of knowledge that rigid specialization won’t.

What do you think? Should analysts focus on one vendor or no? What’s the right approach to diversify? Pros and cons? Let me know in the comments.

The post The dangers of being a one-platform data analyst first appeared on Stringfest Analytics.

What is an open source software distribution?

George Mount — Tue, 13 Jul 2021 21:41:00 +0000

I remember when I first got into Python and heard about Anaconda. It wasn’t really clear to me if Anaconda was a “dialect” of Python code, a really awesome suite of tools for working with Python, or something else.

What is a distribution, anyway?

Turns out that Anaconda is first and foremost a distribution of Python… and to understand what that means, you need a basic understanding of particularly open source software.

Open source is a license to redistribute

Open source software means that anyone is free to use, purpose, modify and redistribute it (See where this is going?).

In software development more generally a “distribution” is how software gets from the developer to the end user. Some distribution channels, such as Office 365, are proprietary. You couldn’t just make copies of Excel, for example, add some features, and give it away or resell it. But because Python is open source, this practice is permissible.

The “official” Python code base is available from the Python Foundation at python.org. While you could download it from there, it’s common especially in the data community instead to do so from Anaconda. Essentially, this for-profit company redistributes the Python code with various features and services; hence, you’ll often hear it referred to as the Anaconda “distribution” of Python. You can download the free Individual Edition on Anaconda’s website.

Why Anaconda?

The Jupyter Notebook application comes installed with Anaconda.

Curious why the common preference to download the Anaconda distribution? First, many of the most common packages for working with data in Python come pre-installed with Anaconda. If you’re not familiar with the concept of packages, check out this post.

Anaconda also comes with many popular applications for working with Python code, such as the Jupyter Notebook. To learn more about Jupyter and Python, check out this post.

Distribution received… now get coding!

Legendary computer scientist Alan Kay once said “perspective is worth 80 IQ points.” The ability to conceptually put Anaconda into perspective with Python and open source in general will make your development choices that much more clear.

But nothing beats learning about code than trial and error by coding. If you’re coming into Python from Excel, check out my book Advancing into Analytics: From Excel to Python and R.

The post What is an open source software distribution? first appeared on Stringfest Analytics.

Installing RStudio

George Mount — Sat, 18 Mar 2017 20:33:45 +0000

First time using R? Check out my free mini-course, “5 Things Excel Users Should Know About R.”

R is a free statistical programming language that is very popular and powerful. I like to use it with RStudio, a program which allows for a smoother user experience along with some enhanced back-end capabilities to base R. The intro-level version of RStudio is also free.

Download instructions:

Here is a helpful video from Udacity on downloading RStudio for Windows (Mac will be very similar). Some screenshots and notes follow.

Download R from the R Project homepage.

CRAN is an acronym for the “Comprehensive R Archive Network” and is a series of non-profit organizations around the globe hosting mirrors of the base R code. R is free and this distribution underscores the open-source nature of the “R project.”

2. Download RStudio.

The base R code you download from CRAN will come with its own graphical user interface (GUI) shown above.

For reasons I discuss in my course, RStudio is a preferred environment for working in R. RStudio is proprietary software that runs “on top” of base R (remember that R is open-source, so anyone can develop on top of the code as they like), but fortunately a free community version is available.

Once you’ve got RStudio on your machine, check out my “Tour of RStudio,” from which the above photo comes.

Important – you must download R and RStudio. RStudio needs the base R code to function.

Until next time…

Ever used R? Have ideas on how it could help you? Questions? Comments? Let me know.

If you are brand-new to learning R, check out my free mini course, “5 Things Excel Users Should Know About R.” You can also check out my blog posts on R using the “rstats” tag.

I hope to start sharing some of the cool things I have learned to do in R, but installing is a good first start.

The post Installing RStudio first appeared on Stringfest Analytics.

Go Researching Waterfalls

George Mount — Thu, 15 Sep 2016 09:30:52 +0000

One of my courses this year is a seminar on information systems. Before each class, we each write a “conversation starter” incorporating our thoughts on the papers.

A topic that has greatly interested me lately is waterfall vs. agile development. “Waterfall” projects hinge on all-or-nothing strategy: let’s spend weeks and months developing something, then release in one fell swoop (like a waterfall building at the top and then crashing down.).

In contrast, agile looks at how to release smaller bits of product to be tested by the audience and incorporated back in an iterative process.

I argue that academic publishing is still built largely on the waterfall approach (let’s spend all our time on one big paper submission and hope it gets in) rather than lean (let’s use digital media to develop a minimally viable research proposal.).

I develop this (and other questions of IS research) below:

——-

Keen in “MIS Research: Reference Disciplines and a Cumulative Tradition” offers a definition of the field from MIT’s Center for Information Systems Research: “a study of the effective design, delivery, and usage of information systems in organizations.” Given that widespread computing in organizations is new, we should not be too surprised that MIS is still finding its bearings as a research discipline. But is this novelty stifling the development of a unique discipline with its own frameworks and agendas?

Yes and no. The adoption of information systems has made good organizations better and bad organizations worse. Information systems can deliver timely, actionable data or keep bright people mired in the weeds of information. So these problems of information and organization are not new – but digitization of information has exacerbated the effects.

IS as a discipline can answer how organizations can use information as an asset – but how do we do this? This requires that IS frames itself both inside and outside academia.

My first job after graduation was in demand planning at a national specialty retailer. I was ready to be the best forecaster in the company’s history. Exponential smoothing, EOQ – you name the theory, I would be using it.
It turns out that very little of my job was built on theories of finance and economics but instead on principles of information. What data do we need? How are we going to get it? Much of my time was spent on fruitless reporting, unheeded requirements analyses, and so forth. I ended up leaving the demand planning job to go into operations finance at a public hospital, where the information problem was so much worse! It was this frustration of how companies use information that in part led me to the program.

A theme of the readings is how to place IS within the “practitioner-scholar” continuum. Can research be rigorous and not esoteric? How can IS establish itself as a discipline with a well-defined research agenda while still benefitting non-academics? One model I see here is economics. As the field has become more rigorous, it has been less understood by practitioners. I have noticed that many of the pioneers of MIS have a background in economics. Maybe they were frustrated with the way things were going in that profession? I was – when looking at graduate programs, so little of the research I saw appeared as useful social science.

So maybe we need to model IS on economics with caution. But the comparison with IS to law and medicine in Davenport and Markus can only go so far. The boundary between professional and user is not as distinct in IS. Most people with direct contact with the legal or medical professionals are themselves lawyers or doctors.

Not so information – everyone is in contact with management information systems. And most of them do not want to be a data architect. They want to get back to their jobs of being an accountant, a non-profit coordinator, a teacher, etc. Those people may also have valuable insights on information and organization, while perhaps not being so knowledgeable about the “IS artifact.” So it is more difficult to see IS as a profession than law or medicine.

This tension of how to fit IS inside management research becomes even more difficult with Big Data. There is an implicit conclusion here that “big = more.” But Ackoff had a great retort to this, even sixty years ago. Do managers really need all the information they think they do? Many managers, under the false pretense of doing “big data,” have simply added more variables to their models. And the more complex these calculations become, the more effort it takes to maintain it. Again, we see people without the necessary skill (or passion) doing “shadow IS” work under a misunderstanding of Big Data.

So, how do these tensions resolve? One meeting point between academia and trade could be publishing. While peer-reviewed journals absolutely have their place, maybe we need to look at media to help the profession outside academia. I see traditional publishing as a “waterfall” form of development where all the work is grouped into one result, which will either be acceptance or rejection. Can we adopt more “agile” based research development via digital media, consulting, practitioner-focused conferences, etc.? Maybe the breakthroughs won’t come from a blog post, but they could be a place for non-academics to collaborate. This allows practitioners to learn from and collaborate with academic research, while setting the boundaries and guidelines for IS academics.

The post Go Researching Waterfalls first appeared on Stringfest Analytics.

The Cognitive Dissonance of Excel

George Mount — Tue, 02 Aug 2016 14:59:44 +0000

Oz du Soleil has a new daily vlog series in August, “Excel Ignited.” He’ll start from the basics, diving into what Excel is for and how to think about data.

In the last lesson, he discussed “What is Excel?”

Sounds pretty basic, but it really got me thinking.

Excel is a spreadsheet tool. Think about that word…spreadsheet. That is, it is a digital piece of paper.

Spreadsheets were used to keep ledgers, track inventory, etc. We’ve digitized this work in Excel.

As such, it is not really a program. A program is a start-stop procedure: “fold this spreadsheet into X dimensions, walk to Y location, and deliver to Z.”

It’s much easier to think in terms of a sheet of paper than in terms of a program written in computer language.

Excel, while a powerful digital program, retains elements of its analog origins.

Readers — am I crazy? Any thoughts here? It’s an idea I’m still developing, so I welcome your feedback — in fact, I love blogging because it allows me to work out ideas with my audience.

The post The Cognitive Dissonance of Excel first appeared on Stringfest Analytics.

How Gall’s Law Explains Crappy Spreadsheets

George Mount — Thu, 19 May 2016 00:02:47 +0000

The other day I came across Gall’s Law, developed in book on systems theory by pediatrician John Gall (polymath!). It’s fascinating. Here is Gall’s Law in all its fitting simplicity:

Subscribe to the newsletter for more Excel tips and tricks.

I love this idea. It’s especially appealing to my economics background — our advanced economy could only have evolved from a simpler economy. A complex economy designed from scratch (presumably from central planner) would never work.

And now, of course, I have to apply this concept to spreadsheets.

One of analysts’ most consistent frustrations is the inherited Franken-spreadsheet.

Maybe the model was meant to look at sales of a particular product line, and it morphed into a mult-factor budgeting tool. The only person who knew how this thing worked was your predecessor. And now you’re left with nothing but some half-assed instructions.

Are bad spreadsheets a result of Gall’s Law?

Computers seem to be one of the few areas of human endeavor where it’s better to start things from scratch than to build on imperfections.

If your house needs plumbing or foundation work, do you demolish things completely? Your essay is incoherent at parts and you delete it all? Probably not. But most spreadsheets and other computer developments are best discarded if flawed.

Most Franken-spreadsheets got that way because they were designed to do something far simpler than they are doing now.

Gall’s Law and the antifragile corollary

I think Gall’s law needs a corollary. What if complex systems evolved from simpler systems and still do not work? This system failed to anticipate growth or evolution — not a good thing.

Here’s an interesting insight from Joshua Porter at Bokardo on using Gall’s Law to design evolution into your software:

The overall effect of Gall’s Law is that most software would start off simple and evolve over time. So we wouldn’t end up with the software we imagined, but the software that managed to live through the early use and subsequent selection process. Accepting this as a rule, could we somehow plan for this evolution even though we don’t know what it will bring? Can we plan for this change? I think so, by building in feedback and reporting mechanisms and merely acknowledging to change the design based on such feedback.

This ability to adapt and thrive to changes reminds me somewhat of Nassim Taleb’s antifragility.

I’m going somewhere with this — before Gall’s Law turns this into a complex mess.

Design spreadsheets not as a static point-in-time set of data, but as a system that is resilient to evolution. Think of reports as living things, not butterflies to be mounted to the wall.

And, if you can’t keep it simple, at least start it simple!

How can you see Gall’s law influencing your spreadsheet modelling or other data analysis?

The post How Gall’s Law Explains Crappy Spreadsheets first appeared on Stringfest Analytics.