python - Stringfest Analytics

How to visually audit Excel workbooks with Python

George Mount — Mon, 09 Mar 2026 01:43:19 +0000

In a previous post, I built a simple Python tool to audit an Excel workbook. The goal there was fairly straightforward: scan the model programmatically and flag common structural issues such as hard-coded numbers in formulas, constants living outside designated input sheets, external links, and formulas that drift out of pattern down a column.

How to build a simple Excel model auditor with Python

That approach gives us something useful: a list of potential problems. But lists of warnings only get you so far. They tell you what might be wrong, but they do not help you understand how the model itself is structured.

In this post I want to take the idea a step further and show how to visualize the structure of an Excel model using Python. Instead of just printing audit findings, we start thinking about the workbook as a network of dependencies. Once you look at it that way, you can actually map the model and make its logic visible.

If you want to follow along, you can download the demo workbook and the accompanying notebook below.

Download the exercise file here

The notebook and full output are presented below. An explanation of what the script is doing and why these checks matter follows afterward.

At a very high level, the notebook treats the spreadsheet less like a grid and more like a system of connected calculations. That shift in perspective matters because many spreadsheet problems are not obvious when you look at cells one by one. They become clearer when you look at how information flows through the model.

With that in place, let’s walk through the code step by step and look at what each part is doing and what it tells us about the structure of the workbook.

Step 1: Loading the workbook as structured data

The first important move in the notebook is this:

file = "demo_financial_model.xlsx"
wb = openpyxl.load_workbook(file, data_only=False)
print("Sheets:", wb.sheetnames)

When this runs, Python opens the workbook and reads it as a structured file rather than as a live Excel interface. That distinction matters.

Inside Excel, you are usually interacting with the model visually. You click cells, inspect formulas, trace arrows, and move around sheet by sheet. Python approaches the workbook differently. It reads the file as data. That means it can inspect every worksheet, every cell, and every formula systematically.

In this example, the workbook contains three sheets:

['Inputs', 'Calculations', 'Outputs']

That already tells us something useful about the intended design of the model. The workbook appears to be organized using a familiar structure:

Inputs for assumptions
Calculations for the working logic
Outputs for final results

That separation is common in well-organized analytical models, and it gives the rest of the notebook something to work with. The code can now ask whether the workbook is actually behaving the way that structure suggests it should.

Step 2: Looking for hardcoded constants outside the Inputs sheet

The next section of the notebook scans the workbook for numeric constants that appear outside the Inputs sheet:

hardcoded_cells = set()
input_sheets = ["Inputs"]for sheet in wb.sheetnames:
    ws = wb[sheet]    for row in ws.iter_rows():
        for cell in row:            if cell.data_type == "n":
                if sheet not in input_sheets:
                    if cell.value is not None:
                        hardcoded_cells.add(f"{sheet}!{cell.coordinate}")

This part is based on a simple modeling principle: in many structured workbooks, important assumptions should live in dedicated input areas, not be scattered across calculation sheets.

When assumptions are centralized, they are easier to review, update, and explain. If key numbers start appearing directly inside the calculation layer, the model becomes harder to trust. A reviewer has to ask: is this value intentional, or did someone overwrite a formula?

In this example, the notebook reports:

Hardcoded cells: {'Calculations!A3', 'Calculations!A4', 'Calculations!A2', 'Calculations!C2'}

This ties directly back to the previous post. There, I already discussed that A2, A3, and A4 were not especially concerning because they are just year labels. But Calculations!C2 was more significant. It represented a hardcoded revenue value in a place where I would normally expect a formula driven by inputs such as units and price.

That is the first reason this notebook matters. It is not just building a pretty visualization. It is carrying forward actual audit findings from the earlier script and placing them into the structural picture of the model.

Step 3: Expanding Excel ranges so the relationships are real

Next, the notebook defines a helper function called expand_range():

def expand_range(sheet, start, end):    col1, row1 = coordinate_from_string(start)
    col2, row2 = coordinate_from_string(end)    col1_i = column_index_from_string(col1)
    col2_i = column_index_from_string(col2)    cells = []    for col in range(col1_i, col2_i + 1):
        for row in range(row1, row2 + 1):
            coord = f"{get_column_letter(col)}{row}"
            cells.append(f"{sheet}!{coord}")    return cells

This may look technical, but the idea behind it is very straightforward.

Excel formulas often refer to ranges instead of individual cells. For example, a formula might sum a range like E2:E4. To a human reader, that clearly means three cells. But if we want to build a dependency graph, we cannot treat E2:E4 as one mysterious blob. We need to expand it into its underlying parts:

E2
E3
E4

That is exactly what this function does. If we skip this step, the graph will miss real dependencies and the workbook will appear simpler than it actually is. A model might look like it has fewer connections than it really does, and that could hide the true influence of certain cells.

This is a good example of where Python adds value beyond Excel’s built-in tools. Excel can certainly evaluate a range formula correctly, but Python lets us unpack that formula into explicit relationships that can be counted, visualized, and analyzed.

Step 4: Building the dependency graph

This is the core of the notebook:

G = nx.DiGraph()cell_pattern = r"[A-Za-z_]+!\$?[A-Za-z]+\$?\d+"
range_pattern = r"([A-Za-z_]+)!([A-Za-z]+\d+):([A-Za-z]+\d+)"for sheet in wb.sheetnames:
    ws = wb[sheet]    for row in ws.iter_rows():
        for cell in row:            location = f"{sheet}!{cell.coordinate}"            if cell.data_type == "f":                formula = cell.value                ranges = re.findall(range_pattern, formula)                for r in ranges:
                    rng_sheet, start, end = r
                    expanded = expand_range(rng_sheet, start, end)                    for ref in expanded:
                        G.add_edge(ref, location)                refs = re.findall(cell_pattern, formula)                for ref in refs:
                    ref = ref.replace("$","")
                    G.add_edge(ref, location)

At a high level, this code scans every formula in the workbook and asks a simple question:

Which cells does this formula depend on?

For every dependency it finds, it adds an edge to the graph. This is where the ideas of nodes and edges become important.

A node is a cell
An edge is a dependency relationship between cells

If one cell feeds another, the graph captures that as an arrow. So if Calculations!B3 depends on Inputs!B2, the graph includes a relationship like this:

Inputs!B2 → Calculations!B3

That arrow matters because it tells us something about the model’s logic. It says that the value in the calculation cell is not independent. It comes from somewhere else. If the source changes, the downstream cell changes too.

This is one of the biggest conceptual shifts in the notebook. Instead of seeing the workbook as just a set of formulas, we start seeing it as a directed network of cause and effect.

In this example, after building the graph, the notebook reports:

Nodes: 19
Edges: 10

So in this small workbook, Python has identified 19 participating cells and 10 relationships between them.

That may not sound dramatic, but it is the foundation for everything that follows. Once those relationships are represented as a graph, we can ask much more interesting questions than Excel’s built-in tools usually make easy.

Step 5: Making sure disconnected hardcoded cells still show up

After building the graph from formulas, the notebook adds this step:

for cell in hardcoded_cells:
    if cell not in G:
        G.add_node(cell)

If a value is hardcoded and not referenced by any formulas, it may never appear in the graph automatically. That is because the graph is being built from formula relationships. No formula relationship means no edge, and possibly no node.

But from an auditing perspective, that kind of disconnected cell can be extremely important.

Take Calculations!C2 in this workbook. That cell contains a hardcoded revenue value. If nothing is pointing to it and it is not pointing anywhere else, it could easily disappear from the graph unless we explicitly add it. But the fact that it is disconnected is itself informative. It tells us that the value is not participating in the model’s logic the way we might expect.

So this step ensures that hardcoded constants still appear in the visualization, even if they are structurally isolated.

That matters because the absence of a connection can be just as meaningful as the presence of one. A disconnected calculation may indicate a harmless label, an abandoned piece of logic, or a genuine modeling error. The graph should give us the chance to see that.

Step 6: Coloring the nodes so the structure becomes readable

Next, the notebook assigns colors based on each node’s role:

def node_color(node):    if node in hardcoded_cells:
        return "gold"    if node.startswith("Inputs"):
        return "lightgreen"    elif node.startswith("Calculations"):
        return "lightskyblue"    elif node.startswith("Outputs"):
        return "salmon"    return "lightgray"

This is not just a cosmetic step. It helps translate the graph into something an Excel user can interpret quickly.

The colors in this example mean:

green = inputs
blue = calculations
red = outputs
gold = hardcoded constants outside the input layer

That last category is especially useful because it overlays the audit results from the first notebook directly onto the model structure. Instead of only printing a list of suspicious cells, we can now see where those cells live inside the architecture of the workbook.

This is where the graph starts to become more than just a technical artifact. It becomes an explanatory tool.

For example, a gold node sitting inside the Calculations sheet tells a story. It says: this cell is behaving more like an input, but it is not living in the input layer. That is a structural inconsistency.

That is exactly what was happening with Calculations!C2.

Step 7: Forcing a layered layout that mirrors how models are designed

Then the notebook creates positions for the nodes:

layers = {
    "Inputs": 0,
    "Calculations": 1,
    "Outputs": 2
}pos = {}
y_positions = {}for node in sorted(G.nodes):    sheet = node.split("!")[0]
    layer = layers.get(sheet, 3)    y_positions.setdefault(sheet, 0)    pos[node] = (layer, -y_positions[sheet])
    y_positions[sheet] += 1

This step matters because raw network graphs can get messy very quickly. If we simply let a graphing library place everything wherever it wants, the result often looks like a tangle of arrows.

Here, instead of using a free-form layout, the notebook deliberately places the sheets in columns from left to right:

Inputs → Calculations → Outputs

That mirrors the intended design of the workbook itself.

This is a subtle but important choice. It makes the graph easier for Excel users to read because it aligns with the way they already think about a model. Inputs feed calculations, and calculations feed outputs.

So even though the graph is generated with Python, it still respects familiar spreadsheet design principles. The purpose is not to turn Excel into an abstract math problem. The purpose is to make the model’s structure more visible.

Step 8: Drawing the graph and sizing nodes by influence

The graph is then drawn with this code:

plt.figure(figsize=(12,10))sizes = [400 + len(nx.descendants(G,n))*100 for n in G.nodes]nx.draw(
    G,
    pos,
    with_labels=True,
    node_color=colors,
    node_size=sizes,
    font_size=8,
    arrows=True
)plt.title("Excel Model Dependency Graph")
plt.show()

Here’s the resulting visualization of our model:

Two things are happening here. First, the notebook is visualizing the dependency network we just built. Second, it is sizing each node based on the number of downstream cells that depend on it. In other words, cells that influence more of the model appear larger.

This matters because not all cells are equally important. Some cells are relatively local. Others sit near the top of the dependency chain and affect many downstream calculations.

Sizing nodes this way helps the graph communicate influence, not just existence.

That is another place where Python adds value beyond Excel’s built-in auditing tools. Excel can tell you the direct precedents or dependents of a selected cell, but Python can help summarize influence more broadly across the model.

Even in a small workbook, this starts to reveal which parts of the spreadsheet are structurally important.

Step 9: Identifying orphan cells

Finally, the notebook looks for orphan cells:

orphans = [n for n in G.nodes if G.degree(n) == 0]
print("Orphan cells:", orphans)

In this example, the result is:

Orphan cells: ['Calculations!A3', 'Calculations!A4', 'Calculations!A2', 'Calculations!C2']

This is one of the most useful outputs in the notebook because it reinforces the earlier audit findings with structural evidence.

The year labels in A2, A3, and A4 are disconnected because they are simply labels. That is fine.

But Calculations!C2 is different. It is disconnected for a more interesting reason: it is a hardcoded value sitting outside the normal assumption flow of the model.

That is exactly what I pointed out in the previous post. There, the script flagged Calculations!C2 as a hardcoded constant. Here, the graph shows us something more: it is not just hardcoded, it is also structurally isolated.

That combination makes the issue much easier to understand. It is no longer just a suspicious number on a list. It is a number that sits outside the model’s normal flow.

How this compares to Excel’s built-in auditing tools

Excel already includes several useful tools for auditing models. Features such as Trace Precedents, Trace Dependents, and Show Formulas can be very helpful when you are trying to understand a particular formula or troubleshoot a specific section of a workbook.

Where these tools become more limited is at the workbook level. They are designed primarily for interactive, cell-by-cell inspection. That approach works well when you already know where to look, but it becomes harder when you want to step back and understand the structure of the entire model.

The Python approach used in this notebook is aimed at a different level of analysis. Instead of examining formulas individually, the script scans the workbook in a single pass, extracts the relationships between cells programmatically, and then summarizes the structure of the model.

In practical terms, this means we can inspect the entire workbook at once rather than reviewing formulas one at a time. It also allows the same logic to be applied consistently across multiple files, makes it possible to combine structural visualization with audit findings, and can automatically surface cells that appear disconnected or suspicious.

The goal is not to replace Excel’s built-in auditing features. Rather, Python provides an additional layer of visibility. Excel remains excellent for manually exploring individual formulas and tracing specific dependencies. Python, on the other hand, is well suited for stepping back, analyzing a workbook systematically, and creating repeatable checks. Used together, the two approaches complement each other very well.

Why this matters now

This kind of workflow is becoming increasingly relevant as spreadsheet generation becomes faster and easier. AI tools can now produce workbook structures, formulas, and even entire models from prompts. As that capability grows, the bottleneck shifts. The challenge is no longer simply producing spreadsheets; it is understanding them, validating them, and maintaining them.

That is where an approach like this becomes useful. With a relatively small amount of code, we can begin to inspect workbook structure automatically, visualize how logic flows through the model, highlight structural violations, and identify cells that fall outside the intended design.

None of this removes the need for human judgment. As the earlier post showed, some audit findings will inevitably be false positives that require interpretation. What it does provide is a faster and more focused way to review a model. Instead of manually searching through a workbook for potential issues, we can use code to narrow down where our attention is most needed.

Want to build tools like this for your team?

This example is intentionally small, but the underlying idea is very practical. Teams that rely heavily on Excel can use Python to build lightweight validation tools, auditing utilities, and analysis workflows around their spreadsheets.

If you want to set up tools like this for your team, strengthen Python capabilities around Excel, or build more reliable analytics workflows, take a look at my How I Work page:

How I work

That is where I outline how I help organizations combine Excel, Python, and modern AI tools in ways that are useful, practical, and grounded in real analytical work.

The post How to visually audit Excel workbooks with Python first appeared on Stringfest Analytics.

How to build a simple Excel model auditor with Python

George Mount — Sat, 07 Mar 2026 13:19:12 +0000

With AI assistants and Excel agents now capable of producing entire workbooks from a prompt, we’re entering a world where models can be created in seconds. That’s exciting, but it also raises an obvious question: how do we trust them?

Auditing Excel models has always been difficult. Excel is designed as an instant recalculation engine: formulas update automatically, dependencies stretch across sheets, and logic can be embedded almost anywhere. While this flexibility is one of Excel’s strengths, it also makes models surprisingly hard to inspect systematically.

Many firms sell specialized spreadsheet auditing tools, but today we also have another option. Because Excel files are structured data, we can analyze them programmatically. With Python, we can build lightweight utilities that scan workbooks, inspect formulas, and flag potential issues automatically.

This post walks through a minimum viable example of building a simple Excel model auditor with Python. The goal isn’t to replace professional auditing software, but to demonstrate how easily we can start analyzing spreadsheet structure with code.

Download the exercise file here

For this walkthrough, we’ll use a small sample Excel workbook that contains a typical analytical model: a few worksheets, some input values, and formulas that generate outputs across the workbook. Our Python script will open the file, scan the cells, and begin identifying useful auditing information such as where formulas appear, where values are hardcoded, and how calculations are distributed across the model.

Python is a natural fit for this kind of task. It’s excellent at parsing structured files, analyzing patterns like formulas, and producing summaries that would be tedious to generate manually in Excel. As AI-generated spreadsheets become more common, tools like this will likely become just as important as the tools that generate the models themselves.

Understanding the script & its benefits

Before diving into the details of the script, it helps to understand the basic idea. Instead of calculating results the way Excel normally does, this program reads the workbook as structured data and inspects its contents for patterns that often signal modeling problems.

The script uses two Python tools to do this. The openpyxl library allows Python to open an Excel workbook and inspect its sheets, cells, and formulas directly. The re library provides regular expressions, which make it easy to detect patterns inside formulas such as numbers, cell references, or links to other workbooks.

Once the workbook is loaded, the script scans every worksheet and every cell. As it moves through the model, it performs a few simple checks that are common in spreadsheet auditing.

For example, it looks for formulas that contain hardcoded numbers. In many analytical models, important assumptions should live in dedicated input cells rather than being embedded inside formulas. The script removes cell references from each formula and checks whether any numbers remain, which can indicate hidden constants.

It also flags formulas that link to external workbooks, since these can introduce hidden dependencies that make models difficult to reproduce or maintain. The script then checks for numeric values entered directly into calculation sheets. Constants are allowed on designated input sheets, but values appearing elsewhere may indicate accidental overwrites or poor model structure.

Finally, the script looks for inconsistent formulas down a column. Many models apply the same calculation repeatedly across rows, so a formula that suddenly breaks the pattern may signal an editing mistake. By normalizing formulas and comparing their structure, the script can detect these inconsistencies automatically.

At the end, any potential issues are printed as a simple audit report showing the worksheet location and the formula or value that triggered the alert.

You might wonder why this is useful when Excel already provides tools like Trace Precedents or Show Formulas. Those tools are helpful for inspecting individual cells, but they are still largely manual and workbook-specific. A Python script like this operates differently: it can scan an entire workbook programmatically, apply consistent auditing rules, and produce repeatable results across many files.

Another important point is that everything here runs locally and is entirely rule-based. No AI is involved. The script simply reads the workbook file and applies a set of deterministic checks. That makes it easy to customize, automate, and integrate into larger workflows when auditing or validating spreadsheets at scale.

Understanding the results

At this stage the goal is not a perfect audit, but a quick structural scan that highlights places worth inspecting.

Here’s a quick summary of what the script detected:

Several hardcoded constants placed directly in the Calculations sheet (cells A2, A3, A4, and C2). In many models, values like these would typically live on an Inputs sheet rather than inside the calculation layer.
A formula containing a hardcoded multiplier (=B4*12 in D4), which could represent a hidden assumption instead of referencing an input cell.
A few inconsistent formula patterns down columns in the Calculations sheet, suggesting the formulas may not follow the same structure from row to row.

None of these automatically mean the model is incorrect. But they are exactly the kinds of patterns auditors often look for when reviewing spreadsheets.

Next, let’s open the workbook and look at these flagged cells directly.

Harmless flags: timeline labels

Some of the findings do not concern me too much.

Cells A2, A3, and A4 on the Calculations sheet were flagged as hardcoded constants, but in this case they simply represent year numbers used to label the timeline of the model. In many spreadsheets this is perfectly reasonable. While it would be possible to generate these dynamically using a function such as SEQUENCE() so the timeline expands automatically, that level of flexibility is not always necessary.

Why the Inputs sheet was not flagged

You might also notice that constants on the Inputs worksheet were not flagged by the script. This behavior is intentional and reflects a common modeling convention.

Earlier in the script, we defined which sheets are allowed to contain constants:

input_sheets = ["Inputs"]

Later in the audit process, the script checks whether a numeric value appears in a worksheet that is not designated as an input sheet. If a number appears outside those allowed sheets, it is flagged as a potential hardcoded constant.

This mirrors a common spreadsheet design principle: inputs should be centralized in dedicated input areas, while calculation sheets should primarily contain formulas.

A real modeling issue

However, one finding does appear to represent a genuine modeling problem.

The script flagged the following cell:

Hardcoded constant cell
Value: 25000
Cell: Calculations!C2

In this case, the Year 1 revenue value has been entered directly as a number rather than being calculated from the underlying assumptions. In most analytical models, revenue should be derived from inputs such as Units and Price rather than hardcoded in the calculation sheet.

If those inputs exist on the Inputs worksheet, this value should likely be calculated as:

Revenue = Units * Price

A hardcoded revenue value bypasses the assumptions driving the model and can easily lead to inconsistencies if the inputs change.

A double offender

Another issue appears in cell D4 on the Calculations sheet, which the script flagged twice:

Hardcoded number inside formula
Formula: =B4*12
Cell: Calculations!D4

Inconsistent formula pattern
Formula: =B4*12
Cell: Calculations!D4

This cell violates two different auditing rules.

First, the formula contains a hardcoded constant. The multiplier 12 is embedded directly in the formula rather than referencing a value from the Inputs sheet.

Second, the formula breaks the expected pattern down the column. Earlier rows appear to follow a different structure, but this row switches to =B4*12, which causes the pattern check to flag it.

More importantly, the calculation itself appears incorrect. If this column represents cost, the correct logic would typically be:

Cost = Units Sold * Cost per Unit

Instead, the formula effectively calculates:

Cost = Units Sold * 12

which disconnects the calculation from the cost assumptions stored on the Inputs sheet.

A useful false positive

Still other findings are more nuanced. The script also flagged these formulas:

Inconsistent formula pattern
Formula: =B2*(1+Inputs!B4)
Cell: Calculations!B3

Inconsistent formula pattern
Formula: =B3*(1+Inputs!B4)
Cell: Calculations!B4

At first glance, these appear to be inconsistencies. In reality, the formulas are correct.

These rows are applying a growth rate from the Inputs sheet:

Next Year Value = Previous Year Value * (1 + Growth Rate)

The reason the script flags them is that year 1 typically does not have a growth calculation. The first year is usually a starting value, while later years grow from that base. Because of this, the formula pattern naturally changes after the first row.

From the script’s perspective this looks inconsistent, but from a modeling standpoint it is perfectly reasonable.

The limits of rule-based auditing

This highlights an important limitation of rule-based auditing.

Structural checks can quickly surface patterns that often indicate problems, but they cannot fully understand the intent of the model. As a result, some findings will inevitably be false positives that require human judgment.

In other words, scripts like this are extremely useful for narrowing down where to look. But catching every issue in a complex spreadsheet remains surprisingly difficult.

Conclusion

In the real world, a production-grade spreadsheet auditing system would take years to design, build, and deploy. Large firms that specialize in spreadsheet risk management invest heavily in tools that can analyze complex models, track dependencies, and enforce organizational standards across thousands of workbooks.

But deep down, many of those tools rely on the same basic idea we explored here in just a few minutes: treat the Excel workbook as structured data, scan it programmatically, and look for patterns that often signal modeling problems.

Even a small script can quickly surface issues such as hardcoded assumptions, broken formula patterns, or hidden dependencies. It won’t replace human judgment, but it can dramatically narrow down where to look.

That idea is becoming more important as Excel itself evolves. With AI assistants and Excel agents now capable of generating entire workbooks from a prompt, the challenge is no longer just building models. It’s understanding, validating, and maintaining them.

In other words, the tools that help us inspect and audit spreadsheets may become just as important as the tools that help us generate them. This simple example is just a starting point.

If you’re interested in exploring more workflows that combine Excel, Python, and modern AI tools to build more reliable analytical models, I share a growing collection of guides, experiments, and practical examples inside my membership.

You can learn more about it here:

The Modern Excel AI Stack is a monthly membership designed for serious Excel users who want to stay current as AI becomes part of everyday analytics work.

The goal is simple: help Excel professionals build intuition and confidence with AI-enabled tools without overwhelm, hype,… pic.twitter.com/wTtv2UwP3H

— George Mount (@gjmount) March 5, 2026

It’s where I’m documenting many of the patterns and tools I’m experimenting with as Excel, automation, and AI continue to evolve.

The post How to build a simple Excel model auditor with Python first appeared on Stringfest Analytics.

Python in Excel: How to edit the initialization environment

George Mount — Wed, 15 Oct 2025 18:55:36 +0000

In an earlier post I talked about the Python in Excel initialization environment:

I likened it to the starting lineup of packages that load every time you open a workbook. It’s what gives you access to pandas, seaborn, and all the other core tools right inside Excel, along with some behind-the-scenes plumbing that lets Python and Excel talk to each other.

How to understand the initialization environment in Python for Excel

As of October 2025, you can now customize this environment. That means you can decide which libraries, helper functions, and even your own custom classes load automatically whenever the workbook opens.

In practice, this opens up a lot of fun possibilities. You can preload machine learning packages, set global formatting so every chart and table looks the way you want, or even create little interactive summaries that pop up when you hover over a cell.

We’ll try a few of these ideas using the mpg dataset, a simple but useful sample of car data. You can follow along with the exercise file below.

Download the exercise file here

1. Preload extra packages

The first thing we can do with editable initialization is call in extra packages.

Keep in mind that Python in Excel doesn’t include everything yet. If you want to see which packages are currently available, check out the below post:

How to see all packages available in Python in Excel

For example, we might want to use scikit-learn to run a quick linear regression. We can add it to our initialization so it’s always ready whenever we open the workbook.

Once that line is in the initialization editor, you can use scikit-learn like any other built-in library.

While you might choose to add imports, I wouldn’t recommend removing anything that’s there by default. Some of it might look random, but a lot of it is important plumbing that makes Python in Excel work properly. Packages like excel handle how data moves between cells and Python, and libraries like pandas and numpy are what most of Excel’s Python features rely on under the hood.

2. Define reusable helper functions

The next thing we can do with editable initialization is add a few helper functions that you find yourself writing over and over.

Think of these like the “utility belt” of your Python in Excel setup. They’re small functions you can reuse across your workbook to clean up data, format values, or calculate common stats. Because they’re baked into your initialization, they’re ready to go every time you open the file.

Here’s an example of a couple we might add to this workbook:

The format_currency() function is great for any data you want to show as dollars and cents — maybe sales figures, costs, or anything you want to make presentation-ready. And kpi_summary() gives you a quick way to summarize your data without writing mean(), min(), and max() in separate cells.

Once these are part of your initialization, you can call them right in Excel like this:

To make sure our new cost-per-mile column is formatted a specific way, we can apply our formatting function right in the formula. Excel can handle formatting too, but this gives us a simple, formula-driven way to control how the output looks from Python.

If we want to be able to change the gas price easily and use it in other parts of the workbook, we can make it a global variable in our initialization. For example, we could add a line like

gas_price_per_gallon = 3.25

in the initialization editor. That way, if the price changes later, we only have to update it once.

3. Set consistent styling defaults

Next, let’s set up a global look for our charts and tables:

What this does is establish a clean, consistent look for all of your seaborn plots. The white grid background makes it easier to read values, the deep palette gives you balanced colors that work well for presentations, and the talk context increases font sizes so your charts are easier to see on slides or dashboards.

Once you add this to your initialization, every plot you make will follow the same styling automatically. You don’t have to restyle every chart or remember which color palette you used last time.

For example, this scatterplot of horsepower versus miles per gallon uses that same theme without any extra formatting in the code:

4. Create custom classes and tooltips

Last but not least, let’s look at something a little more advanced. So far, we’ve been adding functions, constants, and settings to our initialization. Those are great for quick tasks, but sometimes you want to bundle related things together: for example, a dataset and a few operations that always go with it.

That’s what this class does. It’s a small, reusable object that holds a dataset and a couple of ready-made actions you can run on it.

Here’s what’s happening.
The word class just means we’re defining a little container for data and the functions that go with it. In this case, the class is called QuickStats. When we create a new one, it expects a dataset (mpg_df) to be passed in. Inside the class, that dataset is cleaned up a bit, and a few summary numbers are calculated — how many rows there are, the average miles per gallon, and the average weight.

After that, the class has two built-in actions, called methods.

tooltip() creates a small DataFrame you can use as an at-a-glance summary in Excel.
chart() creates a quick scatterplot showing horsepower versus miles per gallon.

Once this is in your initialization, you can use it in Excel like this:

That produces a neat little table card summarizing your data.

Or you can run:

to get a chart, automatically styled with the theme you set earlier.

The win here is that you’ve packaged up a repeatable analysis into something that feels like its own tool. You don’t need to re-type formulas or copy and paste code. You just call the class and it does the work for you.

This is a simple example of object-oriented programming, which just means building small, reusable “objects” that combine your data and the actions you take on it. You don’t need to master OOP to benefit from it, but it’s a great way to keep your analyses organized and consistent across workbooks.

Conclusion

Here’s the complete initialization code with everything we covered in this post:

When you edit your initialization script in Excel, make sure to save your changes before closing the editor. Then go to Formulas > Python > Reset > Reset Runtime for the changes to take effect.

That reset step is important because it restarts the Python engine that Excel uses in the background. Until you do that, Excel is still running the previous environment in memory, so your new imports, functions, or variables won’t load.

Editable initialization is one of those quiet but powerful upgrades to Python in Excel. It saves setup time, makes your workbooks more consistent, and helps you keep analysis tools, constants, and styles in one place.

The quick wins are being able to open any file and start working right away, with your preferred imports, formatting, and helper functions already loaded. The main highlight is consistency: you can build a repeatable workflow that looks and behaves the same across projects.

The post Python in Excel: How to edit the initialization environment first appeared on Stringfest Analytics.

Python in Excel: How to understand the random walk with Copilot

George Mount — Mon, 15 Sep 2025 14:00:06 +0000

As an Excel analyst, you’ve probably had moments where your data seemed to show clear trends,profits rising, sales dropping, or costs climbing steadily. But sometimes, these apparent patterns might actually be random,tricking you into thinking there’s more control and predictability than there really is.

That’s the essence of the “random walk.” It illustrates how purely random processes can look structured, meaningful, and even predictable. Why should you care? Because knowing the difference between true signals and random noise helps you make better, smarter decisions in your analysis, whether it’s forecasting sales, tracking performance, or spotting real trends.

Fortunately, exploring random walks doesn’t require complicated setups or fancy software,just Excel, Python, and Copilot. Open a blank Excel worksheet, launch Copilot, start Advanced Analysis, and you’re ready to follow along.

Simulating a random walk with coin flips

Let’s take a moment to explore randomness using a simple example. Imagine flipping a coin 100 times. Each heads moves you one step forward (+1), each tails one step back (-1). After every flip, you track your position to see how far you’ve drifted from the start.

“Make a Python script that simulates flipping a coin 100 times. Each flip moves you up (+1) or down (–1). Keep a running total so we can see how far we drift. Put the results in a DataFrame with columns Step and Position, and make a simple line plot of the position over time. Explain in comments what’s happening in plain English.”

Copilot responds by creating a neatly organized DataFrame, listing each step alongside your position. It also produces a line chart visualizing your journey through these random flips:

When you examine this chart, you’ll notice the position line moves unpredictably, sometimes drifting upward, other times downward. Each flip was random, yet it’s natural to try to spot trends or meaningful patterns here. Randomness can trick us into seeing patterns that aren’t really there. By practicing simulations like this, you’ll develop intuition about what’s meaningful and what’s simply noise in real-world datasets.

Simulating stock prices with a random walk

Instead of flipping a coin, let’s start our simulation this time with a stock price of 100, then add a small, random increase or decrease each trading day for one year (252 days). This subtle daily randomness resembles the fluctuations you’d see in actual stock data. Here’s exactly what you can ask Copilot to do:

“Now change the random walk so it looks like a stock price. Start at 100 and add a little random up or down each day for 1 year (252 days). Plot the stock price. Add comments explaining why this looks kind of like a real stock price (sometimes up, sometimes down, but trending).”

Notice how, like the coin toss simulation, the stock price doesn’t follow a straight line. Instead, it fluctuates over time, occasionally dipping downward, sometimes climbing higher again. Even though each individual day’s change is random, the result looks strikingly similar to actual stock market behavior.

But here’s the thing: One single forecast can only show one possible future. In real life, many outcomes are possible.

Think of the typical forecasts you see in business like projected revenues, sales goals, or stock prices. They’re often presented as one neat line or a single estimate. But relying solely on one forecast can give a false sense of certainty. It’s easy to overlook that the real world rarely matches that neatly drawn line.

Instead, it’s crucial to embrace uncertainty. By visualizing many different scenarios, you can better understand the full range of possible outcomes. This helps you plan for the unexpected, identify potential risks, and avoid overconfidence in a single prediction.

Creating and visualizing multiple random walk simulations

To truly appreciate this idea, let’s extend our simulation and visualize multiple scenarios at once.

“Simulate 100 random stock price paths instead of just one. Plot them all on the same chart so it looks messy. Add comments explaining why one single forecast isn’t enough — in real life we could end up with any of these.”

At first glance, this chart looks chaotic, and that’s precisely the point. Each line represents one possible scenario that might unfold. Some scenarios end positively, some negatively, and others hover in the middle.

Here’s the key insight for you as an Excel analyst: Relying on a single forecast can be risky, because real-life outcomes can vary widely. When you visualize multiple possible futures, you’re reminded of this uncertainty. This helps you make decisions more cautiously and realistically, preparing you for multiple possibilities instead of placing too much confidence in one narrow prediction.

Visualizing a the “cone of uncertainty”

The previous chart showed us 100 possible outcomes. But realistically, presenting hundreds of scenarios isn’t always practical or helpful. Instead, let’s summarize all these possibilities by showing a clear range of likely outcomes.

Specifically, we’ll calculate the 5th, 50th (median), and 95th percentiles from these 100 scenarios for every day of our simulated year. This creates a visual “uncertainty cone,” clearly communicating not only what’s possible, but also what’s probable.

Here’s how you’d prompt Copilot:

“From the 100 random paths, calculate the 5th, 50th, and 95th percentiles for each day. Plot a shaded cone (5–95%) with the median line in the middle. Add comments explaining how this cone shows a range of likely futures, not just one line.”

Instead of placing too much confidence in a single prediction, this visualization helps you recognize the inherent uncertainty of forecasts. As an analyst, presenting data in this way helps decision-makers understand risk, plan more realistically, and prepare for multiple potential futures—not just the one we hope or expect to see.

Visualizing the distribution of final outcomes

The uncertainty cone showed a broad range of likely outcomes. But let’s focus specifically on the end of our forecast period. If we simulate 500 random stock price paths, where do prices usually end up after one year?

You can prompt Copilot with the following instruction:

“After simulating 500 paths, make a histogram of the final prices after 1 year. Mark the average, the low end (5%), and the high end (95%). Write a short caption in plain English: ‘Most of the time the stock ends up around X, but there’s a chance it could be much lower or higher.'”

This histogram summarizes clearly how our 500 simulations played out. The average final stock price lands around 114, shown by the solid vertical line. But notice the two dashed lines: at the lower end (5%), the stock price could drop to around 87, while at the higher end (95%), it could climb to around 146.

Conclusion

As explored in this post, randomness plays a larger role in your data than you might initially expect. A single forecast can feel clear and precise, but it rarely captures the true uncertainty of real-world outcomes. By visualizing multiple scenarios, using uncertainty cones, and examining distributions of potential outcomes, you communicate clearly about risk and uncertainty, helping decision-makers set more realistic expectations.

These simulations have limitations, however. Real-life data often includes specific trends, seasonal variations, or sudden shocks that our simplified random walk model doesn’t fully address. To extend your analysis further, consider incorporating historical patterns into your simulations or exploring more sophisticated forecasting techniques like Monte Carlo analysis.

Embracing uncertainty makes you a more effective analyst by improving your judgment, clarifying your communication, and better preparing you and your team for a variety of potential futures.

If you’d like to explore these concepts directly, download the complete Excel sample solution file below:

Download the solution file here

The post Python in Excel: How to understand the random walk with Copilot first appeared on Stringfest Analytics.

Python in Excel: How to check Benford’s Law with Copilot

George Mount — Fri, 12 Sep 2025 20:36:22 +0000

Numbers don’t spread out evenly across digits. In many real datasets, the leading digit “1” shows up about 30% of the time while “9” hardly appears. That’s Benford’s Law. If your numbers don’t follow this pattern when they should, it could mean something odd or even fraudulent is going on.

The reason comes from scale. When values stretch across different magnitudes like dollars, populations, or GDPs, lower digits cover more of the number line. That makes “1” appear more often than “9.”

Excel users should care because this is a quick way to sanity check data. Doing it in plain Excel is messy, but with Copilot and Python in Excel you can generate the code, run the test, and see results right in the sheet.

We’ll be working with GDP data from the World Bank. It spans everything from tiny island nations to the biggest economies, which makes it a perfect test case for Benford’s Law. Download the exercise file below to follow along:

Download the exercise file here

Frequency counts of first digit

To check Benford’s Law, I asked Copilot:

“Using GDP data, extract the first digit from each country’s GDP value. Count how frequently each digit from 1 to 9 occurs as the first digit.”

Copilot returned a simple frequency table. The digit 1 appeared 69 times as the first digit, far more than any other. The counts then tapered off, with 2 showing up 50 times, and by the time we reach 9, it only appeared 8 times.

That “long tail” is exactly what Benford’s Law predicts: in many naturally occurring datasets, smaller leading digits like 1s and 2s are far more common than larger ones like 8s and 9s. Why? Because numbers grow exponentially across orders of magnitude. A country’s GDP is just as likely to fall between 10 and 20 billion as between 80 and 90 billion, but the first digit in those ranges will be very different.

This quick count shows the GDP data behaving as expected. If we saw a flat or unusual distribution, that would be a red flag. Analysts sometimes use this property of Benford’s Law to detect anomalies or even fraud in financial reporting.

Visualizing the actual distribution versus Benford’s Law

Once I had Copilot tally up the first digits, I asked it to take the next step:

“Calculate the percentage distribution of these first digits and compare them visually against the expected distribution according to Benford’s Law.”

Copilot generated a side-by-side chart. The blue bars show the observed GDP data; the orange bars are the theoretical Benford’s Law distribution. Notice how closely they track. About 30% of GDP figures start with the digit 1, while fewer than 5% start with 9: almost a perfect fit with Benford’s prediction.

Interpreting the results

To push the analysis further, I asked Copilot to highlight where the GDP data deviates from Benford’s Law.

“Highlight significant deviations between actual first-digit frequency versus Benford’s Law expected frequency, if any. Suggest possible interpretations or concerns from a business auditing perspective.”

Copilot’s commentary pointed out where the GDP data seemed to deviate from Benford’s Law… a few too many 2s and 4s, a few too few 8s and 9s. But are those differences actually meaningful, or just random noise in the dataset?

Testing for statistical significance

That’s where a statistical test comes in. I asked Copilot to run a chi-square test to formally compare the observed distribution against Benford’s Law.

“Perform a statistical test (e.g., chi-square) to quantify the deviation from Benford’s Law.”

The result was a chi-square statistic of about 11.44 with a p-value of 0.18. Since that’s well above the usual 0.05 threshold, we don’t have evidence that the GDP data significantly deviates from Benford’s prediction. In other words, the small bumps we saw in the bar chart are well within the range of normal variation.

Conclusion

So what does all this mean? Benford’s Law isn’t just a neat math curiosity. It’s a practical lens for spotting when numbers might not add up. With Copilot and Python in Excel, you can go from raw data to statistical tests without leaving your spreadsheet.

For auditors, that means a quick way to flag suspicious ledgers. For analysts, it’s a sanity check on whether a dataset behaves like it should. And for anyone working with large, messy numbers, it’s a reminder that the patterns in data can be just as important as the numbers themselves.

The takeaway: if your digits line up with Benford’s Law, good… you can have more confidence in your dataset. If they don’t, that’s not proof of fraud, but it is a signal worth investigating. Either way, Excel now gives you the power to test it yourself with just a few prompts.

The post Python in Excel: How to check Benford’s Law with Copilot first appeared on Stringfest Analytics.

How I became an Excel trainer… the Stringfest origin story

George Mount — Sun, 31 Aug 2025 17:43:38 +0000

Whenever I get asked the same thing two or three times, I figure it’s worth turning into a post. So I am genuinely surprised I haven’t written about how I became an Excel trainer yet. I’ve written about why my business is called Stringfest but not how I got into this business.

So let’s change that in this post.

Liberal arts beginnings

I went to a liberal arts college mostly because I loved writing and exploring different ideas. And I did get a great education, which I credit toward my personal and professional happiness. One thing I absolutely didn’t explore during that time, though, was Excel! I might’ve used it occasionally to make a quick plot or run a simple regression, but I never really thought much about it.

One professor kept encouraging us to learn Excel, yet I didn’t really understand why. It was one of those “unknown unknowns.” I couldn’t grasp why office workers relied on it so heavily, mostly because that whole world felt completely alien to me back then.

And let me tell you, that didn’t help when graduation came around. It was the aftermath of the 2008 financial crisis, I graduated in Michigan and was looking for jobs back home in Cleveland. Wrong place, wrong time much? I ended up heading to grad school to get a master’s in finance, thinking I’d get another year to pull myself together and learn some solid “hard” skills.

It kind of worked—I learned all sorts of useful stuff about WACC, DCF, LBOs, the efficient frontier, and so forth—but guess what? I still wasn’t good enough with Excel. Interviews made it painfully clear that not being an Excel wizard was costing me job opportunities. Deep down, I was still a liberal arts kid who wanted to explore ideas more than spreadsheets.

The Excel wake-up call

Eventually, I did get a job, and my lack of Excel skills hit me hard. I was slow, made mistakes, and honestly just didn’t enjoy my job. Being someone who needs to do things optimally, I dove headfirst into learning Excel. I even started blogging about it. Initially, I blogged about lots of things, but the Excel content was what stuck. So I ran with it.

I moved through a few jobs, improving but still seeing firsthand how badly most organizations handled data literacy. Companies had fancy strategies and initiatives, but none of it meant much without strong Excel fundamentals. It was clear business was changing, and digital innovation was going to separate winners from losers. I felt like many companies just weren’t ready.

Back to school (again)

At the same time, I still had an itch for more education (liberal arts habits die hard!), so I saw grad school as a way to transition out of my job and maybe build an online Excel teaching side hustle into something more sustainable. I enrolled in a PhD program in information systems, which was actually pretty interesting in many ways. I learned a ton about digital innovation and research methods.

But, ultimately, the PhD path wasn’t the right fit for me, and I didn’t finish. Probably for the best, as I didn’t want to commit my life to academia, but the move made job hunting even harder afterward. Companies were skeptical about my random career shifts and multiple master’s degrees.

Fortunately, my blog was still kicking (without a ton of motion, admittedly), even though Excel took a backseat during grad school while I taught myself R and Python for statistical modeling and predictive analytics. But post-grad life was tough. Consulting gigs and course sales were sparse, and I nearly gave up.

Thinkful and finding direction

Then, a friend referred me to a now-defunct career accelerator Thinkful, which was launching a data analytics bootcamp at the time. They initially brought me on for a couple weeks as an Excel subject matter expert, but I ended up staying for three years! I wrote tons of curriculum, helped run the program, and got deep into instructional design. It was awesome.

When that chapter ended, I thought I’d help other organizations build similar data bootcamps. Unfortunately, that idea didn’t have a good product-market fit. Instead, the market wanted Excel. I felt a little typecast as the “Excel guy,” which bothered me because there’s so much crucial analysis work you can’t do in Excel, particularly robust statistical modeling. Still, I needed income, so I stuck with Excel but kept pushing to integrate deeper analytical tools.

The great convergence

Thankfully, the landscape shifted dramatically. Now, with Python integration in Excel, AI tools like Copilot, and automation options like Power Automate and Office Scripts, everything started converging perfectly. Suddenly, I wasn’t forced to choose between Excel and advanced analytics.

I could teach Excel users in finance, accounting, and operations to become citizen data scientists, analysts, engineers, and BI developers right within Excel. My goal became empowering people to handle data efficiently, autonomously, and without waiting around for IT or budgeting for pricey software.

My mission today

This vision informed my content. My first book showed how Python and R could empower Excel users as citizen data scientists, long before Python in Excel was available. My second book focused on modern Excel’s BI toolkit: Power Query, Power Pivot, dynamic arrays, and early AI tools like Analyze Data (now largely superseded by Copilot). Thanks to my liberal arts background and PhD training, I became adept at weaving technical tools into broader professional narratives, whether designing LinkedIn Learning courses or training analyst cohorts at major banks.

Today, when I’m creating content or training materials, I think back to being a new analyst, overwhelmed by chaotic spreadsheets and inadequate skills. I recall how critical statistical literacy felt when making serious decisions. And I empathize with stressed managers who shouldn’t have to worry about IT red tape or software licenses. Those are the things driving my work now.

This post is pretty off-the-cuff, but if there’s interest, maybe I’ll expand it later. For now, at least I have something to point to the next time I’m asked.

The post How I became an Excel trainer… the Stringfest origin story first appeared on Stringfest Analytics.

Copilot in Excel: How to do social influencer analysis with Python

George Mount — Wed, 27 Aug 2025 18:16:39 +0000

Understanding who your most influential customers are—those who consistently recommend your business—is essential for sustainable, cost-effective growth. This approach, called social influencer analysis, identifies customers who act as connectors and referral hubs within your customer base. Historically, performing this type of analysis required complex Excel formulas or specialized analytics software, placing it beyond the reach of many everyday Excel users.

With Python’s NetworkX package now directly accessible inside Excel, performing sophisticated influencer analysis has become intuitive and visual. NetworkX quickly reveals referral relationships, measures influencer importance through centrality metrics, and identifies natural customer communities, giving Excel users a significant analytical advantage. Better yet, by leveraging Copilot and generative AI, you can perform this entire analysis simply by using clear, natural-language prompts.

To get started, download the exercise file below, which includes sample customer referral data you can use to practice these techniques.

Download the exercise file here

Creating a referral network

Once you’ve loaded your referral data into Excel, you’re ready to visualize the network of customer referrals. Using Excel’s built-in Copilot and generative AI, you can prompt it to create a clear and intuitive network graph:

“Using NetworkX, create a directed network graph where each node is a customer, and each edge represents a referral (from referrer_customer to referred_customer). Visualize the graph clearly labeling all nodes.”

The resulting visualization represents customers as nodes (points), with arrows indicating the direction of referrals from one customer to another. From this directed graph, it’s easy to immediately spot key connectors (customers who refer many others, indicated by multiple outgoing arrows) as well as chains of referrals flowing through the customer network.

At a glance, you can see influential referrers like Alice, Bob, and Grace, each having multiple connections. You can also observe referral chains such as Victor → Uma → Tina → Steve, highlighting how certain customers can indirectly drive growth beyond immediate connections.

This powerful yet intuitive visualization provides immediate insights into referral behavior, identifying influential customers and potential areas to strategically encourage even more referrals.

Identifying the most influential customers

After visualizing the customer referral network, the next step is to quantify which customers are the most influential: those who directly referred the highest number of new customers. NetworkX simplifies this step by calculating out-degree centrality, a measure of how many referrals each customer made relative to the entire network.

Using Copilot, we enter the following clear and simple prompt

“Use NetworkX to calculate centrality metrics for each customer node in the referral graph. Identify the top 5 customers based on their out-degree centrality, clearly displaying each customer’s name along with their centrality score.”

The results show the five customers who are most actively referring new customers: Alice, Carol, Grace, and Bob are tied as top influencers, each with an out-degree centrality score of approximately 0.0952. This indicates these customers have made the most referrals compared to the rest of the network. Quinn also appears as an influential customer with a centrality score of about 0.0476, meaning she’s also actively contributing to customer growth.

From a business perspective, these influential customers are ideal candidates for rewards, loyalty programs, or ambassador campaigns. Recognizing and incentivizing these individuals can amplify your growth by motivating them to refer even more new customers, boosting organic growth at a relatively low cost.

Detecting customer communities

Next, we want to discover customer communities, or natural groupings within our referral network, using something called a community detection algorithm. This algorithm will analyze how customers are connected through referrals, finding groups who interact more closely with each other than with the broader network. It starts by taking our referral network (where referrals have a direction, from one customer to another) and simplifying it into connections without direction, effectively just looking at who knows who, without worrying about who referred whom first. Then it systematically identifies groups that are tightly interconnected, giving us natural segments within our customer base.

Here’s our prompt:

“Convert our directed referral graph into an undirected graph, and apply a community detection algorithm (such as Girvan-Newman from NetworkX) to identify distinct customer communities. Clearly and visualize list the customers within each detected community.”

Excel’s traditional grid can have difficulty clearly showing these community clusters, because the algorithm returns groups as Python lists rather than familiar Excel tables (DataFrames). However, when in doubt, simply ask Copilot for a visual representation.

As shown above, Copilot provided a clear, intuitive graph visually highlighting three distinct communities within the network, each identified with its own color. The blue community, for example, contains customers like Victor, Uma, Tina, Steve, Grace, and Quinn, who strongly interact with each other. Another tightly-knit group (the orange community) includes Bob, Mike, David, Ian, and John, clearly showing internal referral activity. Lastly, customers Alice, Carol, Eva, Henry, Laura, and Kate form a cohesive third (green) community.

With this visualization, it becomes straightforward to identify how your customers naturally segment themselves into groups, offering immediate insights into targeted marketing opportunities—without needing deep data science expertise.

Summarizing business insights

To wrap up our analysis, let’s ask Copilot to provide a concise summary highlighting key insights and practical recommendations:

“Provide a concise summary of insights gained from this analysis, emphasizing the most influential customers (referrer_customer), clearly defined customer communities, and recommendations for targeted marketing campaigns.”

Copilot returns the following write-up:

Based on these insights, your marketing team should focus efforts on the key influencers identified within each customer community. By engaging and rewarding these influential referrers, your marketing messages will resonate deeply within their groups, naturally spreading through their trusted connections, ultimately increasing your campaign’s impact and effectiveness.

Conclusion

Social influencer analysis provides Excel users with a powerful yet intuitive method to uncover hidden insights within their customer referral data—insights that were traditionally locked behind specialized analytics tools.

However, as powerful as these methods are, they’re not without challenges. A significant limitation in performing this kind of analysis typically involves the initial step: collecting, cleaning, and structuring your referral data. In many organizations, referral data might be scattered across emails, customer databases, or CRM systems, each with inconsistent formats. Analysts commonly tackle these data challenges by standardizing naming conventions, automating data cleaning tasks with tools like Power Query or Python scripts, and establishing clear data-entry protocols to prevent future inconsistencies.

As a next step, consider pairing this influencer analysis with other customer data you already maintain such as demographics, purchasing history, or customer lifetime value. Integrating this additional context can provide even richer insights, allowing for more targeted and effective marketing campaigns. Finally, remember that the value of this type of analysis extends beyond just identifying your top customers. It builds organizational confidence in making data-driven decisions, empowering Excel users across the business to extract deeper value from their existing tools and data.

The post Copilot in Excel: How to do social influencer analysis with Python first appeared on Stringfest Analytics.

Copilot in Excel: How to build ARIMA forecasts with Python

George Mount — Tue, 26 Aug 2025 23:46:53 +0000

ARIMA (short for AutoRegressive Integrated Moving Average) is a classic statistical model for time series forecasting. It works by combining three elements: autoregression (using past values to predict future ones), integration (removing trends to make data stationary), and moving averages (capturing patterns in forecast errors).

For analysts, it’s valuable because it turns historical data into actionable forecasts, helping answer questions like “What’s likely to happen next month?” or “How will sales look next quarter?”

Excel does have built-in forecasting models, such as exponential smoothing, but it does not include ARIMA. Implementing ARIMA in traditional Excel requires complex workarounds or external tools, making it impractical for most users.

With Python and Copilot in Excel, this changes. Instead of wrestling with formulas or switching to another application, you can describe your goal in plain language and let Copilot generate the Python code for you. Python handles the statistical heavy lifting while Copilot explains each step. This makes ARIMA forecasting far more accessible and transparent than it’s ever been for Excel users.

In this post, we’ll walk through an ARIMA forecast on the famous airline passengers dataset, which tracks monthly passenger totals from 1949 to 1960. Follow along with the exercise file below.

Download the exercise file here

Explanation of ARIMA

First, we’ll have Copilot clearly explain the ARIMA model components and visualize the data with this prompt:

Explain the components of an ARIMA model and describe how each part captures different aspects of time series data. Then plot the data to visualize trends and seasonality.

Copilot gives us a clear, concise summary:

Checking the data’s suitability for ARIMA

It’s good to have a clear game plan in place if we decide to move forward with ARIMA. But how do we know if ARIMA is actually a good choice for our specific dataset? That’s exactly what we’ll ask Copilot next, with this prompt:

“Evaluate the suitability of this dataset for ARIMA forecasting by assessing its characteristics, and discuss whether alternative models should be considered based on the data’s properties.“

This explanation is clear and helpful, pointing out important details. Specifically, Copilot notes the strong autocorrelation, meaning past values significantly influence current observations. It also identifies the presence of seasonality and non-stationarity. Non-stationarity means the statistical properties, like mean and variance, change over time… something ARIMA explicitly addresses through differencing.

Copilot also rightly suggests that because of the clear seasonal pattern, we might want to consider a Seasonal ARIMA (SARIMA) or even an Exponential Smoothing model, both of which explicitly handle seasonality. This highlights the importance of carefully matching our forecasting model choice to the unique characteristics of our data.

Checking for stationarity and performing seasonal adjustments

At this point, Copilot has shown us that our dataset has strong seasonal patterns, which could make forecasting challenging if left unaddressed. With this next prompt, our goal is to break down the concept of seasonal differencing into clear, non-technical steps. This helps Excel users grasp exactly how adjusting for seasonality makes the dataset “stationary” for ARIMA to handle effectively.

“Explain what seasonal adjustments mean for this dataset and then apply a seasonal difference to the data step-by-step. Tell me how this helps ARIMA in a way a non-technical Excel user can follow.“

As Copilot explains, seasonal adjustments essentially mean removing regular, predictable patterns such as spikes in sales every holiday season from your data. It does this by subtracting each month’s value by the value from the same month the previous year. By doing this, the repeating seasonal pattern disappears, leaving behind clearer information about trends and unpredictable changes.

Creating the ARIMA forecast

Next, we’ll prompt Copilot to build the model. As we do this, it’s important to clearly see how closely our model’s predictions match actual historical data. We’ll use the following prompt:

“Fit a simple ARIMA model to the data. Visualize and explain the model summary in simple language. Focus on what Excel users need to know to see if the model is good.“

Copilot’s output is exactly what we want: clear, practical, and visual. The dashed orange line (predictions) closely tracks the blue line (actual data), which means our ARIMA model is accurately capturing the pattern in the historical data. This visual check is crucial because it gives Excel users a quick and intuitive way to judge the quality of the ARIMA forecast without needing to dive deeply into complex statistical metrics.

Visualizing the ARIMA forecast

Since this initial check looks promising, we might next explore using Copilot to evaluate forecast accuracy in greater depth.

Recall earlier, to make our dataset suitable for ARIMA, we performed seasonal differencing. While differencing helped us make the data stable enough to accurately forecast, it also transformed the data away from its original units, making our predictions harder to interpret directly. That’s why it’s essential we now reverse this step, restoring the forecasts back to their original scale, such as actual monthly passenger counts. This way, Excel users can directly compare forecasts to historical data.

We’ll use the following prompt to have Copilot clearly handle this step and complete the forecast:

“Use my fitted ARIMA model to forecast the next 12 months of the original (non-differenced) dataset. After forecasting, reverse the seasonal differencing to return the forecasted values to their original scale. Plot these forecasted values alongside the historical data (on its original scale), clearly labeling actual vs. forecasted values. Explain the forecast in a way that Excel users can easily understand.“

Copilot’s visual clearly shows the forecast in the familiar context of our original historical passenger counts, making it easy for Excel users to see how the future predictions naturally follow the historical pattern. By reversing the seasonal differencing step, users can confidently interpret forecasts in practical terms, quickly gauging expected trends and seasonal behaviors going forward.

Comparing ARIMA to Holt-Winters visually

We’ve successfully created forecasts with our ARIMA model. But to make sure we’re using the best model possible, let’s compare ARIMA to another popular forecasting method—Exponential Smoothing (specifically the Holt-Winters approach)—which is particularly good at handling clear trends and seasonal patterns like we have in this dataset.

Here’s the prompt we’ll give Copilot to do that:

“Now fit an exponential smoothing (Holt-Winters) model with a 12-month seasonal period. Plot its forecast on the same chart as the ARIMA forecast and the actual data so they’re easy to compare. Briefly explain in plain language which model appears to fit better and why that might be the case for this dataset.“

If we look closely at the visualization, both forecast lines (ARIMA and Holt-Winters) appear very similar and closely match historical patterns. While Copilot’s explanation gently nudges us toward Holt-Winters—primarily because Holt-Winters explicitly handles trend and seasonality without requiring differencing—it’s tough to confidently pick the “best” model by eye alone.

Comparing ARIMA to Holt-Winters through accuracy metrics

Since these forecasts look pretty close, we can’t make a definitive choice based solely on visual comparisons. To choose confidently, we should dive into summary metrics (such as mean absolute error or mean squared error) next. These numerical measures will clearly quantify which model provides the most accurate predictions.

“Evaluate my forecasts by calculating key accuracy metrics like RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and MAPE (Mean Absolute Percentage Error). Explain each metric simply, focusing on what it means for assessing forecast quality. Then summarize whether my models provide reliable forecasts based on these results.“

Copilot’s summary metrics give us a precise, easy-to-follow comparison. RMSE, MAE, and MAPE all indicate forecast accuracy—the smaller, the better. While visually both models seemed similar, these numbers clarify that ARIMA actually provides slightly more precise forecasts in this case.

This step highlights to Excel users that accuracy metrics can uncover subtle differences between forecasting methods, helping them confidently select the best model for their data.

Conclusion

Working through ARIMA forecasting step by step shows how each component of the model contributes something valuable. Autoregression lets past values guide our predictions, integration helps manage trends and keep forecasts realistic, and moving averages handle unexpected fluctuations smoothly. ARIMA combines these elements effectively, making it versatile for various datasets with clear historical patterns.

Yet as we’ve seen, exploring alternative methods such as Holt-Winters exponential smoothing provides additional context. Holt-Winters explicitly manages trends and seasonality, often making it an excellent choice for data with strong repeating patterns. Evaluating these models side by side underscores the value of testing multiple forecasting techniques. Accuracy metrics like RMSE, MAE, and MAPE further highlight subtle but meaningful differences between models, guiding more confident model selection.

The next logical steps in your forecasting journey include experimenting with more sophisticated variations. Tools like Prophet or even machine learning methods such as gradient boosting or neural networks might also be appropriate when the data becomes particularly complex or when traditional methods fall short. Many of these advanced approaches remain easily accessible directly in Excel through Python and Copilot integration.

By expanding your toolkit to include methods like ARIMA and Holt-Winters, you gain deeper insight into the strengths and limitations of each forecasting technique, enabling you to make better-informed decisions tailored specifically to your data challenges.

The post Copilot in Excel: How to build ARIMA forecasts with Python first appeared on Stringfest Analytics.

Python in Excel: How to do feature engineering with Copilot

George Mount — Mon, 25 Aug 2025 23:14:37 +0000

Feature engineering involves transforming raw data into meaningful variables to enhance machine learning models. It includes creating new features, encoding categorical data, and scaling or normalizing values—tasks that significantly boost model accuracy and insights. This practice naturally overlaps with data cleaning, as both involve handling missing values, inconsistencies, and outliers. Feature engineering also aligns with exploratory data analysis (EDA), since insights from EDA often guide effective feature creation.

For Excel users, mastering feature engineering expands the potential for more sophisticated analysis and predictive modeling. While Excel’s Power Query provides powerful data transformation tools, certain advanced tasks, especially those involving complex statistical rules or extensive group-wise transformations, are much easier and more efficient using Python and Copilot, enabling streamlined and scalable operations beyond standard Excel workflows.

In this post, you’ll explore advanced feature engineering techniques with the Palmer Penguins dataset. Download the exercise file below to follow along:

Download the exercise file here

Dummy-coding categorical variables

Categorical variables like species or island are common in datasets but can’t be directly interpreted by most predictive models. Typically, these categories need conversion into numeric dummy variables, representing each distinct category as its own binary column. We began our analysis using the following prompt:

“Convert the categorical variables species, island, and sex to numerical dummy variables suitable for modeling.”

Copilot quickly generated dummy variables, clearly converting each categorical variable into binary columns for each category (e.g., species_Adelie, island_Biscoe, sex_Male). This numerical format allows our data to be effectively used by predictive models and machine learning algorithms.

Binning quantitative variables

Sometimes numeric values like body mass provide more insight when grouped into meaningful categories rather than considered individually. To efficiently address this, we gave Copilot the following prompt:

“Bin body_mass_g into three categories (light, medium, heavy) based on quantiles, and add this as a categorical feature.”

Copilot quickly responded by categorizing penguin body mass into three intuitive groups—light, medium, and heavy—based on the underlying quantiles of the data. The output clearly indicates the category each penguin falls into, turning numeric complexity into easy-to-understand categorical labels.

By creating quantile-based categories, analysts can quickly identify patterns, make clearer comparisons across groups, and feed simplified, meaningful variables into predictive models—all achieved effortlessly through Python and Copilot integration directly within Excel.

Creating group-wise statistical measures

Numeric summaries grouped by categories, such as median values by species, are frequently essential features for deeper data insights. To efficiently create these features, we provided Copilot with this prompt:

“Calculate the median body mass by species and add this as a new numeric feature to each row of the dataset.“

Copilot quickly calculated the median body mass for each penguin species and added these values directly into each row as a new numeric feature. This new column allows us to easily identify how an individual penguin compares against the typical body mass of its species, highlighting significant deviations or confirming typical measurements.

Standardizing variables with Z-score scaling

Numeric features in datasets often vary greatly in scale, making them challenging to compare directly. For instance, a penguin’s body mass (measured in grams) naturally has a larger numeric scale than its bill length (measured in millimeters). This discrepancy can distort analyses, especially when performing predictive modeling or clustering.

To address this issue clearly and efficiently, we gave Copilot the following prompt:

“Apply standardization (Z-score scaling) to the numeric columns bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g using Python.”

Copilot quickly applied Z-score scaling to our selected numeric features. This transformation converts each numeric value into a standardized score representing how many standard deviations it is from the feature’s average. A standardized value near 0 indicates the measurement is close to the average, while positive or negative values reflect deviations above or below the average, respectively.

Standardization makes these numeric features directly comparable, allowing each to contribute equally and meaningfully to subsequent analyses.

Advanced missing value imputation

Dealing with missing values is a common challenge in data analysis. Excel and Power Query offer basic options for handling missing data, but these tools fall short when it comes to advanced predictive imputation, especially when we want to use similar, complete observations to estimate missing values.

To effectively overcome this limitation, we provided Copilot with the following prompt:

“Perform advanced imputation by predicting missing values in each row based on the most similar complete observations in the dataset.”

Copilot quickly applied the K-Nearest Neighbors (KNN) imputation method, filling in missing numeric values by finding the most similar penguins in the dataset and using their known measurements to predict missing data. The resulting dataset now has complete observations for each numeric feature, providing a robust foundation for further modeling and analysis.

Check out this post for a little more on k-nearest neighbors with Copilot, Python and Excel:

Python in Excel: How to do k-nearest neighbors with Copilot

Conclusion

Feature engineering is foundational for achieving robust, predictive analytics. By embracing Python and Copilot alongside Excel, you’re positioned to execute sophisticated transformations that previously required significant manual effort or were simply unattainable with basic Excel or Power Query techniques alone.

Yet, while Python-powered feature engineering significantly expands your analytical capabilities, it’s essential to remain mindful of potential limitations. Advanced methods require clear understanding and interpretation to avoid unintended biases or misrepresentations in your data. Additionally, ensuring seamless integration between Python-driven transformations and traditional Excel workflows will require thoughtful structuring and documentation.

It is crucial to deeply engage with your data through exploratory and visual methods to verify that the engineered features truly address your analytical needs. Feature engineering inherently involves substantial trial and error. Although Copilot significantly accelerates experimentation and automates repetitive tasks, its effectiveness hinges upon clear and accurate guidance. Without properly defined parameters, Copilot might inadvertently speed up the creation of irrelevant or misleading features.

Moving forward, consider experimenting with these techniques on your own datasets to identify unique opportunities for improvement. Challenge yourself to combine the intuitive visualizations of Excel with the scalable computational power of Python, leveraging Copilot’s AI assistance to bridge gaps and streamline your processes. As your comfort grows, you’ll uncover new strategies to elevate your analyses, enabling more impactful and data-driven decision-making.

The post Python in Excel: How to do feature engineering with Copilot first appeared on Stringfest Analytics.

Python in Excel: How to build decision trees with Copilot

George Mount — Thu, 21 Aug 2025 18:45:39 +0000

Decision trees are intuitive analytical tools that help us explore and predict outcomes by asking simple, branching questions about our data. They’re popular because they’re visual, easy to interpret, and closely mimic human decision-making processes. However, Excel users historically faced an uphill battle using decision trees. There’s no built-in Excel feature for building them, forcing analysts either to leave Excel entirely or to spend hours painstakingly crafting complex logical formulas. Thankfully, that’s no longer the case.

With the integration of Python and Copilot’s Advanced Analysis directly inside Excel, decision trees have gone from intimidating to effortless. Excel users can now leverage Python’s powerful machine learning capabilities without leaving the comfort of their spreadsheet. Even better, Copilot guides you through each step, helping you understand and interpret results clearly… no coding or advanced statistics degree required.

In this post, we’ll demonstrate how you can easily create and interpret decision trees using Python and Copilot in Excel, helping HR professionals understand factors driving employee attrition. To follow along, download the IBM HR Employee Attrition dataset below. We’ll use it to build a simple decision tree model, evaluate its accuracy, visualize key insights, and turn these findings into actionable business recommendations.

Download the exercise file here

Setting the business context

We start our analysis with this straightforward prompt:

“Briefly summarize what this dataset is about and explain why using a decision tree might help an HR analyst better understand attrition.”

Here, we’re taking a moment to step back and ensure we understand the underlying business problem: Why do employees leave, and what can we do about it?

Copilot’s output deliberately moves beyond simply describing what is in the dataset. It clarifies why each factor, such as age, role, income, or satisfaction level, is relevant in an HR context. By highlighting that a decision tree visually separates employees based on traits associated with higher or lower attrition risk, Copilot helps us understand the analytical process itself.

Building the decision tree

For our next prompt, we’ll Copilot to build our decision tree model, automatically picking out the most important factors for predicting employee attrition:

“Build a basic decision tree model to predict employee attrition using the most relevant features from the dataset. Explain briefly why the model chose these features as important.”

In the resulting output, we can see the results clearly listed out, with features like MonthlyIncome, OverTime, and TotalWorkingYears emerging as the top predictors.

But Copilot doesn’t leave us hanging there: it also gives us context. For instance, it points out that employees with lower monthly incomes or who frequently work overtime tend to leave the company more often. This makes sense from a practical HR standpoint, right? Knowing these drivers lets us immediately think about actionable solutions.

With Copilot’s help, we’re quickly moving from raw data toward insights that can genuinely impact employee retention and organizational strategy.

Testing the decision tree on new data

Here’s our next prompt. We’re now checking how well our decision tree works with new data, something it hasn’t “seen” before:

“Split the data into training and testing sets (80/20). Train the decision tree model on the training set and tell me how accurately it predicts attrition on new data. Briefly explain why this testing step matters for business decisions.”

Why are we doing this? Because a model might be great at making sense of the past (training data) but struggle to handle future or unseen cases. Splitting the dataset lets us simulate real-world scenarios—training the model on one portion of data, then testing it on a smaller, separate portion to gauge how it might perform “in the wild.”

Copilot’s output here is clear and straightforward. It tells us the decision tree predicted employee attrition accurately about 86% of the time on new, unseen data. Then, importantly, it explains why this matters: testing gives us confidence that our insights aren’t just theoretical.

Building and interpreting a visualization

For our next prompt, we’ll ask Copilot to create a clear visual representation of our trained decision tree right within Excel, making it easier to understand how different factors influence employee attrition:

“Create a visualization of the trained decision tree right here in Excel. Walk me through what the first few splits mean for HR—what factors most influence employees leaving?”

Visualizing the decision tree translates abstract numeric data into an intuitive flowchart, making it easy for HR analysts or stakeholders to understand how the model predicts employee attrition. The current visualization shows the first three layers of the decision tree, using color shading to represent the predicted outcomes: orange indicates employees likely to stay, blue indicates those likely to leave, and lighter shades reflect less certainty.

At the highest level, whether an employee works overtime emerges as the most important factor. Employees not working overtime tend to remain at the company, especially those with more years of experience, higher hourly rates, stock options, and greater job satisfaction. Conversely, those who do work overtime show higher attrition rates, particularly if their monthly income is low, they’re unmarried, hold lower-level job roles, or haven’t been promoted recently.

While this visualization stops after three layers for readability, in practice, the decision tree would branch further into additional, more detailed layers. To keep interpretations clear and manageable, analysts typically “prune” or simplify deeper branches. Even without further pruning here, it’s clear from this initial analysis that factors such as overtime, compensation, promotion frequency, and job satisfaction significantly influence employee retention.

As a quick practical note, plots produced by Copilot may initially appear small, making labels hard to read. You can manually adjust the figure size (figsize) in your Python code, or explicitly request Copilot to resize it for clarity. Additionally, while Copilot is helpful for generating visuals, it can’t currently interpret detailed text within images or plots. For assistance interpreting visualizations, I’d suggest taking a screenshot and pasting it into Microsoft Copilot.

Developing business recommendations

After carefully exploring and interpreting the factors the decision tree identifies as important predictors of attrition, the natural next step is to translate these insights into practical actions:

“Based on this decision tree analysis, provide two or three simple, actionable recommendations HR can implement to reduce employee attrition.”

The resulting output directly responds to this goal, providing clear recommendations HR can feasibly implement. This ensures the analysis moves beyond descriptive insights into concrete, effective strategies for improving employee retention.

Conclusion

Decision trees are powerful yet accessible tools, helping Excel users quickly uncover meaningful insights from complex data… no advanced coding skills required. As we’ve seen, integrating Python and Copilot directly into Excel significantly reduces the friction traditionally associated with advanced analytics. Excel users across any business function can now rapidly identify key factors driving outcomes like employee attrition, customer churn, sales performance, and more, turning raw data into actionable strategies.

However, while decision trees provide intuitive visuals and practical recommendations, they come with certain limitations. They simplify complex relationships and might overlook subtle factors or interactions hidden in your data. Their effectiveness also heavily depends on data quality. Errors, gaps, or biases can affect the accuracy of insights.

Looking ahead, Excel users can take this analysis further by comparing decision trees against other analytical methods, such as logistic regression or random forests, to validate results and strengthen confidence in findings. Regularly refreshing your analysis with new data ensures your insights remain relevant, especially as business conditions change.

Ultimately, while Copilot and Python unlock new possibilities in Excel, the importance of your own domain expertise remains central. Your ability to interpret these outputs thoughtfully, using practical business judgment, ensures recommendations remain realistic, impactful, and tailored to your organization’s goals.

The post Python in Excel: How to build decision trees with Copilot first appeared on Stringfest Analytics.