machine learning - Stringfest Analytics

Python in Excel: How to conduct linear regression with Copilot

George Mount — Wed, 12 Nov 2025 16:56:15 +0000

Data and tech trends come and go, but linear regression has remained one of the most reliable tools in a data analyst’s toolbox. It helps you identify relationships, test hypotheses, and make predictions with a clear view of how different factors influence outcomes. Whether you’re working in finance, marketing, manufacturing, or real estate, few methods match linear regression for both clarity and ease of use.

For all the benefits, however, Excel’s built-in regression tools were never very user friendly. You had to load the Analysis ToolPak (if you could find it) step through a dated wizard, and then make sense of an output sheet that offered little guidance. Changing your model or presenting the results to others was awkward.

With Copilot, things are much smoother. You can build more advanced models with Python in Excel, understand how they work, and interpret the results directly within your workbook. It’s easier to see what your data is telling you and focus on meaningful conclusions rather than the mechanics.

We’ll explore this using a fuel economy dataset. Download the exercise file below to follow along.

Download the exercise file here

If you haven’t used the Advanced Analysis with Python feature yet, take a look at this post:

How to get started with Advanced Analysis with Python for Copilot in Excel

To get started, we’ll run a very simple linear regression: just one dependent and one independent variable. It’s a good habit to make the scope of your model explicit, even when you’re testing something small. In this case, it makes sense to treat mpg as the variable we’re trying to explain and weight as the factor we think influences it.

Here’s the Copilot prompt I used:

“Run a linear regression in Python where mpg is the dependent variable and weight is the independent variable. Include a summary of the results.”

Copilot automatically fitted the model using the dataset and produced the following regression summary:

The interpretation is straightforward: as a car’s weight increases, its fuel efficiency tends to decline. The negative coefficient for weight means heavier vehicles use more fuel. The very small p-value confirms the relationship is statistically significant.

This is the classic starting point for regression analysis: one variable at a time, clear direction, and easily interpretable results. From here, we can begin layering in more predictors to see how horsepower, displacement, or cylinder count refine the story.

Adding more predictors and checking model diagnostics

Now that we’ve built our first model, it’s natural to wonder what other factors might influence fuel economy. Weight appears significant, but horsepower and acceleration could also play a part.

As we start refining our models, we need a way to tell if each new version is actually improving. Two standard metrics help with this: R-squared, which shows how much of the variation in mpg is explained by the predictors, and RMSE, which measures the average prediction error in miles per gallon.

Here’s the Copilot prompt:

“Fit a multiple linear regression model in Python predicting mpg using weight, horsepower, and acceleration.
Calculate and return the model’s R-squared and RMSE as a small summary table.“

The R-squared value of about 0.71 means roughly 71 percent of the variation in fuel efficiency is explained by these three variables. The RMSE of 4.22 means the model’s predictions are off by about four miles per gallon on average. It’s a noticeable improvement over our single-variable model and a good sign that we’re moving in the right direction.

Visualizing predicted versus actual values

Once you’ve built a model and reviewed the metrics, it’s important to see how well the predictions line up with reality. A quick visual check often reveals patterns or problems that numbers alone can miss.

“Plot the predicted vs actual mpg values from the model to check how well the regression fits. Include a line showing perfect predictions for reference.”

Copilot produced a scatter plot comparing the model’s predicted mpg values with the actual ones. Each point represents a car in the dataset. The red dashed line shows what perfect predictions would look like, where predicted and actual values are exactly equal.

This visualization gives a quick gut check on model performance. The tighter the points hug that line, the stronger the predictive power. And while the model isn’t perfect, it’s doing a solid job of explaining how weight, horsepower, and acceleration interact to influence fuel efficiency.

Interpreting model coefficients

You might be wondering how each variable contributes. That’s where interpretation comes in, and Copilot can help you reason through it, not just calculate.

Here’s the prompt:

“Interpret the coefficients of the model using statsmodels. Which features have the biggest impact on mpg and in what direction? Explain in plain language.”

Copilot returned a summary showing that both weight and horsepower have negative coefficients. This means that as either of these increases, fuel efficiency tends to decrease. Weight has the strongest influence. Each additional unit of weight leads to the largest drop in miles per gallon. Horsepower also lowers mpg, though not quite as sharply.

Acceleration, on the other hand, shows a very small and statistically insignificant coefficient, suggesting it doesn’t meaningfully affect fuel economy in this dataset. In other words, how quickly a car accelerates doesn’t matter much for mpg once weight and horsepower are already accounted for.

Together, these results tell a clear story: heavier and more powerful cars use more fuel, while quick acceleration on its own doesn’t add much explanatory value.

Checking model assumptions

Once you’ve built and interpreted your model, it’s a good idea to run a few quick diagnostics to make sure the basic assumptions of linear regression hold. One of the most important checks is to look at the residuals, or the differences between the predicted and actual values.

Here’s the Copilot prompt:

“Plot the residuals of the model. Are they randomly distributed? Is there evidence of non-linearity or heteroskedasticity?“

Copilot produced a residuals vs. predicted values plot. Ideally, the points should be scattered randomly around the zero line. That pattern suggests the model is capturing the data well and that errors are evenly spread across all prediction levels.

In this case, the residuals look mostly random, but there’s a slight funnel shape as mpg increases. That widening spread hints that the model may fit smaller cars a bit more consistently than larger ones, a mild sign of heteroskedasticity. It’s not severe, but it’s worth noting.

Residual plots are one of several ways to check whether your model is behaving properly. You can also look at whether the relationships between predictors and mpg appear roughly linear, whether residuals seem normally distributed, or whether there’s evidence that one error predicts the next. These checks help confirm that the model’s estimates are trustworthy.

Copilot can guide you through these steps, not just by generating plots or statistics, but by explaining what they mean and why they matter. In that sense, it acts less like a calculator and more like a coach, helping you understand the reasoning behind good modeling practice.

Making predictions

Finally, let’s put the model to work in a real-world example. In business settings, the real value of regression often isn’t just understanding relationships. It’s using those relationships to make predictions. Decision-makers care less about the exact slope of a line and more about what it means for future outcomes: how a change in product weight, horsepower, or price might affect performance or profit. A well-built model lets you turn analysis into foresight.

Here’s the Copilot prompt:

“Given a car with 3000 lbs weight, 130 horsepower, and 15 seconds of acceleration, use the regression model to predict mpg.“

Copilot returned a predicted fuel efficiency of about 21.6 miles per gallon.

That means for a car with those specifications, the model expects it to travel roughly 21 and a half miles on a gallon of fuel. This is where regression analysis becomes more than just theory. You can use it to estimate outcomes for new observations, guide design tradeoffs, or compare how different features affect performance.

Conclusion

Iinear regression remains one of the most practical and interpretable tools in data analysis, and Copilot makes it easier than ever to use inside Excel. Even a simple model can uncover useful insights when built thoughtfully and checked carefully. Metrics like R-squared and RMSE help quantify performance, but visuals and diagnostics often reveal the places where your model fits well and where it struggles.

And in the business world, the real power of regression lies in prediction. The ability to estimate how changes in one factor might influence another turns analysis into something decision-ready.

That said, linear regression isn’t magic. It assumes straight-line relationships and evenly distributed errors, which don’t always hold up with messy real-world data. Outliers, overlapping variables, or curved relationships can throw things off, and that’s where judgment comes in. Copilot can automate the steps, but it still takes a human eye to decide what makes sense.

From here, you might explore adding interaction terms, adjusting variables to handle nonlinearity, or comparing results to more flexible models like decision trees or random forests. You could even use Copilot to test cross-validation or experiment with feature selection to see how stable your model really is.

The post Python in Excel: How to conduct linear regression with Copilot first appeared on Stringfest Analytics.

Copilot in Excel: How to build ARIMA forecasts with Python

George Mount — Tue, 26 Aug 2025 23:46:53 +0000

ARIMA (short for AutoRegressive Integrated Moving Average) is a classic statistical model for time series forecasting. It works by combining three elements: autoregression (using past values to predict future ones), integration (removing trends to make data stationary), and moving averages (capturing patterns in forecast errors).

For analysts, it’s valuable because it turns historical data into actionable forecasts, helping answer questions like “What’s likely to happen next month?” or “How will sales look next quarter?”

Excel does have built-in forecasting models, such as exponential smoothing, but it does not include ARIMA. Implementing ARIMA in traditional Excel requires complex workarounds or external tools, making it impractical for most users.

With Python and Copilot in Excel, this changes. Instead of wrestling with formulas or switching to another application, you can describe your goal in plain language and let Copilot generate the Python code for you. Python handles the statistical heavy lifting while Copilot explains each step. This makes ARIMA forecasting far more accessible and transparent than it’s ever been for Excel users.

In this post, we’ll walk through an ARIMA forecast on the famous airline passengers dataset, which tracks monthly passenger totals from 1949 to 1960. Follow along with the exercise file below.

Download the exercise file here

Explanation of ARIMA

First, we’ll have Copilot clearly explain the ARIMA model components and visualize the data with this prompt:

Explain the components of an ARIMA model and describe how each part captures different aspects of time series data. Then plot the data to visualize trends and seasonality.

Copilot gives us a clear, concise summary:

Checking the data’s suitability for ARIMA

It’s good to have a clear game plan in place if we decide to move forward with ARIMA. But how do we know if ARIMA is actually a good choice for our specific dataset? That’s exactly what we’ll ask Copilot next, with this prompt:

“Evaluate the suitability of this dataset for ARIMA forecasting by assessing its characteristics, and discuss whether alternative models should be considered based on the data’s properties.“

This explanation is clear and helpful, pointing out important details. Specifically, Copilot notes the strong autocorrelation, meaning past values significantly influence current observations. It also identifies the presence of seasonality and non-stationarity. Non-stationarity means the statistical properties, like mean and variance, change over time… something ARIMA explicitly addresses through differencing.

Copilot also rightly suggests that because of the clear seasonal pattern, we might want to consider a Seasonal ARIMA (SARIMA) or even an Exponential Smoothing model, both of which explicitly handle seasonality. This highlights the importance of carefully matching our forecasting model choice to the unique characteristics of our data.

Checking for stationarity and performing seasonal adjustments

At this point, Copilot has shown us that our dataset has strong seasonal patterns, which could make forecasting challenging if left unaddressed. With this next prompt, our goal is to break down the concept of seasonal differencing into clear, non-technical steps. This helps Excel users grasp exactly how adjusting for seasonality makes the dataset “stationary” for ARIMA to handle effectively.

“Explain what seasonal adjustments mean for this dataset and then apply a seasonal difference to the data step-by-step. Tell me how this helps ARIMA in a way a non-technical Excel user can follow.“

As Copilot explains, seasonal adjustments essentially mean removing regular, predictable patterns such as spikes in sales every holiday season from your data. It does this by subtracting each month’s value by the value from the same month the previous year. By doing this, the repeating seasonal pattern disappears, leaving behind clearer information about trends and unpredictable changes.

Creating the ARIMA forecast

Next, we’ll prompt Copilot to build the model. As we do this, it’s important to clearly see how closely our model’s predictions match actual historical data. We’ll use the following prompt:

“Fit a simple ARIMA model to the data. Visualize and explain the model summary in simple language. Focus on what Excel users need to know to see if the model is good.“

Copilot’s output is exactly what we want: clear, practical, and visual. The dashed orange line (predictions) closely tracks the blue line (actual data), which means our ARIMA model is accurately capturing the pattern in the historical data. This visual check is crucial because it gives Excel users a quick and intuitive way to judge the quality of the ARIMA forecast without needing to dive deeply into complex statistical metrics.

Visualizing the ARIMA forecast

Since this initial check looks promising, we might next explore using Copilot to evaluate forecast accuracy in greater depth.

Recall earlier, to make our dataset suitable for ARIMA, we performed seasonal differencing. While differencing helped us make the data stable enough to accurately forecast, it also transformed the data away from its original units, making our predictions harder to interpret directly. That’s why it’s essential we now reverse this step, restoring the forecasts back to their original scale, such as actual monthly passenger counts. This way, Excel users can directly compare forecasts to historical data.

We’ll use the following prompt to have Copilot clearly handle this step and complete the forecast:

“Use my fitted ARIMA model to forecast the next 12 months of the original (non-differenced) dataset. After forecasting, reverse the seasonal differencing to return the forecasted values to their original scale. Plot these forecasted values alongside the historical data (on its original scale), clearly labeling actual vs. forecasted values. Explain the forecast in a way that Excel users can easily understand.“

Copilot’s visual clearly shows the forecast in the familiar context of our original historical passenger counts, making it easy for Excel users to see how the future predictions naturally follow the historical pattern. By reversing the seasonal differencing step, users can confidently interpret forecasts in practical terms, quickly gauging expected trends and seasonal behaviors going forward.

Comparing ARIMA to Holt-Winters visually

We’ve successfully created forecasts with our ARIMA model. But to make sure we’re using the best model possible, let’s compare ARIMA to another popular forecasting method—Exponential Smoothing (specifically the Holt-Winters approach)—which is particularly good at handling clear trends and seasonal patterns like we have in this dataset.

Here’s the prompt we’ll give Copilot to do that:

“Now fit an exponential smoothing (Holt-Winters) model with a 12-month seasonal period. Plot its forecast on the same chart as the ARIMA forecast and the actual data so they’re easy to compare. Briefly explain in plain language which model appears to fit better and why that might be the case for this dataset.“

If we look closely at the visualization, both forecast lines (ARIMA and Holt-Winters) appear very similar and closely match historical patterns. While Copilot’s explanation gently nudges us toward Holt-Winters—primarily because Holt-Winters explicitly handles trend and seasonality without requiring differencing—it’s tough to confidently pick the “best” model by eye alone.

Comparing ARIMA to Holt-Winters through accuracy metrics

Since these forecasts look pretty close, we can’t make a definitive choice based solely on visual comparisons. To choose confidently, we should dive into summary metrics (such as mean absolute error or mean squared error) next. These numerical measures will clearly quantify which model provides the most accurate predictions.

“Evaluate my forecasts by calculating key accuracy metrics like RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and MAPE (Mean Absolute Percentage Error). Explain each metric simply, focusing on what it means for assessing forecast quality. Then summarize whether my models provide reliable forecasts based on these results.“

Copilot’s summary metrics give us a precise, easy-to-follow comparison. RMSE, MAE, and MAPE all indicate forecast accuracy—the smaller, the better. While visually both models seemed similar, these numbers clarify that ARIMA actually provides slightly more precise forecasts in this case.

This step highlights to Excel users that accuracy metrics can uncover subtle differences between forecasting methods, helping them confidently select the best model for their data.

Conclusion

Working through ARIMA forecasting step by step shows how each component of the model contributes something valuable. Autoregression lets past values guide our predictions, integration helps manage trends and keep forecasts realistic, and moving averages handle unexpected fluctuations smoothly. ARIMA combines these elements effectively, making it versatile for various datasets with clear historical patterns.

Yet as we’ve seen, exploring alternative methods such as Holt-Winters exponential smoothing provides additional context. Holt-Winters explicitly manages trends and seasonality, often making it an excellent choice for data with strong repeating patterns. Evaluating these models side by side underscores the value of testing multiple forecasting techniques. Accuracy metrics like RMSE, MAE, and MAPE further highlight subtle but meaningful differences between models, guiding more confident model selection.

The next logical steps in your forecasting journey include experimenting with more sophisticated variations. Tools like Prophet or even machine learning methods such as gradient boosting or neural networks might also be appropriate when the data becomes particularly complex or when traditional methods fall short. Many of these advanced approaches remain easily accessible directly in Excel through Python and Copilot integration.

By expanding your toolkit to include methods like ARIMA and Holt-Winters, you gain deeper insight into the strengths and limitations of each forecasting technique, enabling you to make better-informed decisions tailored specifically to your data challenges.

The post Copilot in Excel: How to build ARIMA forecasts with Python first appeared on Stringfest Analytics.

Python in Excel: How to do feature engineering with Copilot

George Mount — Mon, 25 Aug 2025 23:14:37 +0000

Feature engineering involves transforming raw data into meaningful variables to enhance machine learning models. It includes creating new features, encoding categorical data, and scaling or normalizing values—tasks that significantly boost model accuracy and insights. This practice naturally overlaps with data cleaning, as both involve handling missing values, inconsistencies, and outliers. Feature engineering also aligns with exploratory data analysis (EDA), since insights from EDA often guide effective feature creation.

For Excel users, mastering feature engineering expands the potential for more sophisticated analysis and predictive modeling. While Excel’s Power Query provides powerful data transformation tools, certain advanced tasks, especially those involving complex statistical rules or extensive group-wise transformations, are much easier and more efficient using Python and Copilot, enabling streamlined and scalable operations beyond standard Excel workflows.

In this post, you’ll explore advanced feature engineering techniques with the Palmer Penguins dataset. Download the exercise file below to follow along:

Download the exercise file here

Dummy-coding categorical variables

Categorical variables like species or island are common in datasets but can’t be directly interpreted by most predictive models. Typically, these categories need conversion into numeric dummy variables, representing each distinct category as its own binary column. We began our analysis using the following prompt:

“Convert the categorical variables species, island, and sex to numerical dummy variables suitable for modeling.”

Copilot quickly generated dummy variables, clearly converting each categorical variable into binary columns for each category (e.g., species_Adelie, island_Biscoe, sex_Male). This numerical format allows our data to be effectively used by predictive models and machine learning algorithms.

Binning quantitative variables

Sometimes numeric values like body mass provide more insight when grouped into meaningful categories rather than considered individually. To efficiently address this, we gave Copilot the following prompt:

“Bin body_mass_g into three categories (light, medium, heavy) based on quantiles, and add this as a categorical feature.”

Copilot quickly responded by categorizing penguin body mass into three intuitive groups—light, medium, and heavy—based on the underlying quantiles of the data. The output clearly indicates the category each penguin falls into, turning numeric complexity into easy-to-understand categorical labels.

By creating quantile-based categories, analysts can quickly identify patterns, make clearer comparisons across groups, and feed simplified, meaningful variables into predictive models—all achieved effortlessly through Python and Copilot integration directly within Excel.

Creating group-wise statistical measures

Numeric summaries grouped by categories, such as median values by species, are frequently essential features for deeper data insights. To efficiently create these features, we provided Copilot with this prompt:

“Calculate the median body mass by species and add this as a new numeric feature to each row of the dataset.“

Copilot quickly calculated the median body mass for each penguin species and added these values directly into each row as a new numeric feature. This new column allows us to easily identify how an individual penguin compares against the typical body mass of its species, highlighting significant deviations or confirming typical measurements.

Standardizing variables with Z-score scaling

Numeric features in datasets often vary greatly in scale, making them challenging to compare directly. For instance, a penguin’s body mass (measured in grams) naturally has a larger numeric scale than its bill length (measured in millimeters). This discrepancy can distort analyses, especially when performing predictive modeling or clustering.

To address this issue clearly and efficiently, we gave Copilot the following prompt:

“Apply standardization (Z-score scaling) to the numeric columns bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g using Python.”

Copilot quickly applied Z-score scaling to our selected numeric features. This transformation converts each numeric value into a standardized score representing how many standard deviations it is from the feature’s average. A standardized value near 0 indicates the measurement is close to the average, while positive or negative values reflect deviations above or below the average, respectively.

Standardization makes these numeric features directly comparable, allowing each to contribute equally and meaningfully to subsequent analyses.

Advanced missing value imputation

Dealing with missing values is a common challenge in data analysis. Excel and Power Query offer basic options for handling missing data, but these tools fall short when it comes to advanced predictive imputation, especially when we want to use similar, complete observations to estimate missing values.

To effectively overcome this limitation, we provided Copilot with the following prompt:

“Perform advanced imputation by predicting missing values in each row based on the most similar complete observations in the dataset.”

Copilot quickly applied the K-Nearest Neighbors (KNN) imputation method, filling in missing numeric values by finding the most similar penguins in the dataset and using their known measurements to predict missing data. The resulting dataset now has complete observations for each numeric feature, providing a robust foundation for further modeling and analysis.

Check out this post for a little more on k-nearest neighbors with Copilot, Python and Excel:

Python in Excel: How to do k-nearest neighbors with Copilot

Conclusion

Feature engineering is foundational for achieving robust, predictive analytics. By embracing Python and Copilot alongside Excel, you’re positioned to execute sophisticated transformations that previously required significant manual effort or were simply unattainable with basic Excel or Power Query techniques alone.

Yet, while Python-powered feature engineering significantly expands your analytical capabilities, it’s essential to remain mindful of potential limitations. Advanced methods require clear understanding and interpretation to avoid unintended biases or misrepresentations in your data. Additionally, ensuring seamless integration between Python-driven transformations and traditional Excel workflows will require thoughtful structuring and documentation.

It is crucial to deeply engage with your data through exploratory and visual methods to verify that the engineered features truly address your analytical needs. Feature engineering inherently involves substantial trial and error. Although Copilot significantly accelerates experimentation and automates repetitive tasks, its effectiveness hinges upon clear and accurate guidance. Without properly defined parameters, Copilot might inadvertently speed up the creation of irrelevant or misleading features.

Moving forward, consider experimenting with these techniques on your own datasets to identify unique opportunities for improvement. Challenge yourself to combine the intuitive visualizations of Excel with the scalable computational power of Python, leveraging Copilot’s AI assistance to bridge gaps and streamline your processes. As your comfort grows, you’ll uncover new strategies to elevate your analyses, enabling more impactful and data-driven decision-making.

The post Python in Excel: How to do feature engineering with Copilot first appeared on Stringfest Analytics.

Python in Excel: How to build decision trees with Copilot

George Mount — Thu, 21 Aug 2025 18:45:39 +0000

Decision trees are intuitive analytical tools that help us explore and predict outcomes by asking simple, branching questions about our data. They’re popular because they’re visual, easy to interpret, and closely mimic human decision-making processes. However, Excel users historically faced an uphill battle using decision trees. There’s no built-in Excel feature for building them, forcing analysts either to leave Excel entirely or to spend hours painstakingly crafting complex logical formulas. Thankfully, that’s no longer the case.

With the integration of Python and Copilot’s Advanced Analysis directly inside Excel, decision trees have gone from intimidating to effortless. Excel users can now leverage Python’s powerful machine learning capabilities without leaving the comfort of their spreadsheet. Even better, Copilot guides you through each step, helping you understand and interpret results clearly… no coding or advanced statistics degree required.

In this post, we’ll demonstrate how you can easily create and interpret decision trees using Python and Copilot in Excel, helping HR professionals understand factors driving employee attrition. To follow along, download the IBM HR Employee Attrition dataset below. We’ll use it to build a simple decision tree model, evaluate its accuracy, visualize key insights, and turn these findings into actionable business recommendations.

Download the exercise file here

Setting the business context

We start our analysis with this straightforward prompt:

“Briefly summarize what this dataset is about and explain why using a decision tree might help an HR analyst better understand attrition.”

Here, we’re taking a moment to step back and ensure we understand the underlying business problem: Why do employees leave, and what can we do about it?

Copilot’s output deliberately moves beyond simply describing what is in the dataset. It clarifies why each factor, such as age, role, income, or satisfaction level, is relevant in an HR context. By highlighting that a decision tree visually separates employees based on traits associated with higher or lower attrition risk, Copilot helps us understand the analytical process itself.

Building the decision tree

For our next prompt, we’ll Copilot to build our decision tree model, automatically picking out the most important factors for predicting employee attrition:

“Build a basic decision tree model to predict employee attrition using the most relevant features from the dataset. Explain briefly why the model chose these features as important.”

In the resulting output, we can see the results clearly listed out, with features like MonthlyIncome, OverTime, and TotalWorkingYears emerging as the top predictors.

But Copilot doesn’t leave us hanging there: it also gives us context. For instance, it points out that employees with lower monthly incomes or who frequently work overtime tend to leave the company more often. This makes sense from a practical HR standpoint, right? Knowing these drivers lets us immediately think about actionable solutions.

With Copilot’s help, we’re quickly moving from raw data toward insights that can genuinely impact employee retention and organizational strategy.

Testing the decision tree on new data

Here’s our next prompt. We’re now checking how well our decision tree works with new data, something it hasn’t “seen” before:

“Split the data into training and testing sets (80/20). Train the decision tree model on the training set and tell me how accurately it predicts attrition on new data. Briefly explain why this testing step matters for business decisions.”

Why are we doing this? Because a model might be great at making sense of the past (training data) but struggle to handle future or unseen cases. Splitting the dataset lets us simulate real-world scenarios—training the model on one portion of data, then testing it on a smaller, separate portion to gauge how it might perform “in the wild.”

Copilot’s output here is clear and straightforward. It tells us the decision tree predicted employee attrition accurately about 86% of the time on new, unseen data. Then, importantly, it explains why this matters: testing gives us confidence that our insights aren’t just theoretical.

Building and interpreting a visualization

For our next prompt, we’ll ask Copilot to create a clear visual representation of our trained decision tree right within Excel, making it easier to understand how different factors influence employee attrition:

“Create a visualization of the trained decision tree right here in Excel. Walk me through what the first few splits mean for HR—what factors most influence employees leaving?”

Visualizing the decision tree translates abstract numeric data into an intuitive flowchart, making it easy for HR analysts or stakeholders to understand how the model predicts employee attrition. The current visualization shows the first three layers of the decision tree, using color shading to represent the predicted outcomes: orange indicates employees likely to stay, blue indicates those likely to leave, and lighter shades reflect less certainty.

At the highest level, whether an employee works overtime emerges as the most important factor. Employees not working overtime tend to remain at the company, especially those with more years of experience, higher hourly rates, stock options, and greater job satisfaction. Conversely, those who do work overtime show higher attrition rates, particularly if their monthly income is low, they’re unmarried, hold lower-level job roles, or haven’t been promoted recently.

While this visualization stops after three layers for readability, in practice, the decision tree would branch further into additional, more detailed layers. To keep interpretations clear and manageable, analysts typically “prune” or simplify deeper branches. Even without further pruning here, it’s clear from this initial analysis that factors such as overtime, compensation, promotion frequency, and job satisfaction significantly influence employee retention.

As a quick practical note, plots produced by Copilot may initially appear small, making labels hard to read. You can manually adjust the figure size (figsize) in your Python code, or explicitly request Copilot to resize it for clarity. Additionally, while Copilot is helpful for generating visuals, it can’t currently interpret detailed text within images or plots. For assistance interpreting visualizations, I’d suggest taking a screenshot and pasting it into Microsoft Copilot.

Developing business recommendations

After carefully exploring and interpreting the factors the decision tree identifies as important predictors of attrition, the natural next step is to translate these insights into practical actions:

“Based on this decision tree analysis, provide two or three simple, actionable recommendations HR can implement to reduce employee attrition.”

The resulting output directly responds to this goal, providing clear recommendations HR can feasibly implement. This ensures the analysis moves beyond descriptive insights into concrete, effective strategies for improving employee retention.

Conclusion

Decision trees are powerful yet accessible tools, helping Excel users quickly uncover meaningful insights from complex data… no advanced coding skills required. As we’ve seen, integrating Python and Copilot directly into Excel significantly reduces the friction traditionally associated with advanced analytics. Excel users across any business function can now rapidly identify key factors driving outcomes like employee attrition, customer churn, sales performance, and more, turning raw data into actionable strategies.

However, while decision trees provide intuitive visuals and practical recommendations, they come with certain limitations. They simplify complex relationships and might overlook subtle factors or interactions hidden in your data. Their effectiveness also heavily depends on data quality. Errors, gaps, or biases can affect the accuracy of insights.

Looking ahead, Excel users can take this analysis further by comparing decision trees against other analytical methods, such as logistic regression or random forests, to validate results and strengthen confidence in findings. Regularly refreshing your analysis with new data ensures your insights remain relevant, especially as business conditions change.

Ultimately, while Copilot and Python unlock new possibilities in Excel, the importance of your own domain expertise remains central. Your ability to interpret these outputs thoughtfully, using practical business judgment, ensures recommendations remain realistic, impactful, and tailored to your organization’s goals.

The post Python in Excel: How to build decision trees with Copilot first appeared on Stringfest Analytics.

Copilot in Excel: How to do K-means clustering with Python

George Mount — Sun, 10 Aug 2025 15:04:03 +0000

K-means clustering is a machine learning technique that groups similar data points into meaningful clusters. For Excel users, it’s a simple but powerful way to uncover hidden patterns like customer segments, market trends, or pricing opportunities without needing advanced stats knowledge. In this post, we’ll explore how to use it with the Windsor housing prices dataset.

Previously, performing clustering analysis in Excel required complex formulas or manual workarounds. Now, with Python integrated directly into Excel and the ease of prompting through Copilot, this approach is accessible and straightforward. Copilot generates and explains Python scripts for you, streamlining the analysis without messy formulas or VBA.

For this demonstration, we’ll use the Windsor housing prices dataset. You can follow along with the exercise file below:

Download the exercise file here

Preparing the dataset

The first step is to clean our data, starting with normalization. K-means clustering struggles when columns are on completely different scales. In our dataset, lot size is measured in thousands of square feet, price is in the tens or hundreds of thousands, and bedrooms is a single-digit number. If we feed this directly into k-means, the larger numbers will dominate the results, even if they’re not the most important features. Let’s start with this prompt:

Normalize this data to prepare for k-means clustering.

With Python in Excel, Copilot automatically recognized that we needed to scale each numeric column so that all features contribute equally. The code it generated uses standard scaling, subtracting the mean from each column and dividing by the standard deviation. This transforms each column to have a mean of 0 and a standard deviation of 1.

Choosing the number of clusters

Once our data is normalized, the next step is deciding how many clusters (K) we should use in k-means. Choosing K is a classic challenge: pick too few, and your clusters will be overly broad; pick too many, and you’ll end up overfitting noise.

To guide this decision, I used the following prompt for Copilot:

Test K-means clustering with K values from 2 to 6 and show me the silhouette score for each K so I can decide the best number of clusters.

The silhouette score is a metric that tells us how well each point fits within its assigned cluster compared to other clusters. It ranges from -1 to 1:

Closer to 1 means points are tightly grouped and well-separated from other clusters (good).
Around 0 means clusters overlap (so-so).
Negative values mean many points are in the wrong cluster (bad).

Copilot tested k-means clustering for K values from 2 through 6, then plotted the silhouette scores to help identify the best fit. Statistically, K=2 came out on top with the highest score (around 0.32), meaning it gives the clearest separation between groups. As we move to K=3, the score dips, bottoms out at K=4, and then edges upward again for K=5 and K=6, though none surpass K=2’s peak performance.

If we were optimizing purely for the math, K=2 would be the obvious choice. But data analysis is more than just optimizing for metrics. Your real goal is insights that matter. A two-cluster split might be too broad to be useful for many real-world applications. If the goal is to segment customers into personas, categorize products into distinct tiers, or group properties by market potential, a finer segmentation can be more actionable. That’s why I’m going with three clusters.

Running k-means clustering

With the optimal number of clusters chosen (three, in this case) it’s time to run k-means clustering for real and see how our data is grouped. Here’s what I asked Copilot:

Run K-means clustering with 3 clusters, add the cluster number as a new column to my DataFrame, and display the first 10 rows

Copilot used Python to fit the normalized dataset and then assigned each row to one of the three clusters. It then appended a new cluster column to our DataFrame, so we can see exactly which group each property belongs to.

Analyzing cluster characteristics

Once each property is assigned to a cluster, the next step is to understand what makes each cluster unique. I asked Copilot:

For each cluster, calculate the average price, lot size, bedrooms, and bathrooms

The results make the differences between the three clusters easy to see. Cluster 0 contains mid-range homes, with average prices around $72,000, moderately sized lots close to 6,900 square feet, and just under three bedrooms. Cluster 1 represents more budget-friendly properties, averaging $53,000 with smaller lots of about 4,000 square feet and slightly fewer bedrooms. Cluster 2 stands out as the premium tier, with average prices near $96,000, larger lots of roughly 6,000 square feet, and more generous space: about 3.6 bedrooms and 1.9 bathrooms.

Visualizing clusters

Once each row had a cluster label, I asked Copilot to plot lot size against price, color-coded by cluster, with black X’s marking the cluster centers.

Create a scatter plot of lot size vs price colored by cluster, with cluster centers marked.

The result shows three distinct groups: blue in the smaller-lot, lower-price range, green covering mid-to-large lots with varied prices, and orange concentrated in the mid-lot, higher-price zone.

This confirms our earlier choice of three clusters. There’s visible separation, with some overlap, and the “middle” segment adds nuance that two clusters would miss. Plotting the data in its original units makes the chart intuitive and keeps the centroids in real-world terms, ready for naming and deeper business analysis.

Translating into business insights

By this point in the workflow, we’ve normalized the data, chosen a cluster count, assigned each property to a group, and explored those groups statistically and visually. That’s all valuable, but it’s still technical. The next step is translating these results into something a non-technical audience can immediately understand and act on. Here’s my next prompt for that:

Based on the clustering results, give me a plain-language summary of what each cluster represents, the key differences between them, and 2–3 recommendations for how a business (such as a real estate agency) could use these insights for decision-making or strategy.

Copilot returned the following insights:

Copilot’s response simplified the technical results into three distinct groups: mid-range homes for typical families, affordable smaller properties for budget-conscious buyers or investors, and premium larger homes for buyers seeking luxury and status.

This step is critical because it transforms abstract analysis into actionable strategy. Without it, we’re left with charts and averages; with it, we have a clear plan for shaping marketing campaigns, setting prices, and guiding property improvements. It’s the bridge that turns data science into business impact.

Conclusion

In this walkthrough, we used Python in Excel and Copilot to run a complete k-means clustering analysis on the Windsor housing prices dataset. We started by normalizing the data to ensure fair treatment of all features, tested different values of K to find the optimal number of clusters, assigned each property to a cluster, and then profiled and visualized those groups.

The advantages of this approach are clear: Python in Excel removes the old manual and formula-heavy pain points, Copilot automates the coding and explanation process, and visualizations make the results more intuitive. It’s fast, repeatable, and accessible to analysts who aren’t full-time data scientists.

However, k-means clustering has its trade-offs. The results can be sensitive to scaling, the choice of K requires judgment beyond the math, and the technique assumes clusters are roughly spherical in shape, which isn’t always the case in real-world data.

From here, you might refine the analysis by experimenting with different features, testing alternative clustering algorithms like DBSCAN or hierarchical clustering, or integrating external datasets for richer insights.

What do you think. Would this approach work for your data? Share your thoughts or questions in the comments, and I’ll be happy to help.

The post Copilot in Excel: How to do K-means clustering with Python first appeared on Stringfest Analytics.

How to understand Advanced Analysis with Python for Copilot in Excel

George Mount — Fri, 30 May 2025 19:21:37 +0000

As an Excel trainer and course creator who often covers more advanced topics, I get a lot of questions about these newer features, particularly Python in Excel and Copilot in Excel. “When should I use these?” “How do I actually use them well?” “Does this mean regular Excel isn’t good enough anymore?”

I get it. And adding another layer to this maze with Copilot and Python all working together inside Excel is wild, and for a lot of people a little anxiety-inducing.

So in this post, I want to talk about when Advanced Analysis with Copilot for Python in Excel is actually the right fit for your workflow, and when it might not be.

Because let’s be clear: not every shiny new AI-powered Excel feature is meant to replace your beloved PivotTables, SUMIFs, or even basic formulas. This isn’t magic. It’s a tool. And like any tool, it has moments where it really shines… and other times where it just doesn’t make much sense.

First off, what exactly is Advanced Analysis with Copilot?

Advanced Analysis is Microsoft’s latest feature that combines Python scripting directly in Excel using Copilot’s AI assistance. The idea is straightforward: you get to run Python code seamlessly within your Excel workbook, guided by Copilot. It’s designed for sophisticated analytical tasks, including time series forecasting, quick machine learning modeling, and data visualizations.

Here’s what’s cool about it:

It generates Python code snippets you can run directly in Excel.
It covers use cases that Excel traditionally struggles with like time series, machine learning, clustering analysis, advanced visuals, and quick hypothesis testing.
It’s auditable. You get to see the Python code it generates. You can review, tweak, or adapt the generated Python code as needed.
It uses data that’s already in your workbook. No need to juggle CSVs or open up a separate Jupyter Notebook to do your Python-backed insights.

Here’s what’s less cool:

It’s not really designed to be super collaborative or shareable. The analysis is quick, iterative, exploratory… more like your own private analytical playground. Sure, you get the Python code generated by Copilot, but actually working it into more permanent reports or analyses is probably going to take some muscle. Plus, there’s the unavoidable fact that a lot of people on your team may not even know Python.
And, drumroll please: this tool isn’t foolproof, isn’t meant for beginners, and definitely shouldn’t be taken as “right” 100% of the time. Like any AI tool, Copilot can, and will, occasionally deliver results that are a bit off-target.

Now let’s dive deeper into the specific scenarios when this really becomes the right tool to use.

When to embrace Advanced Analysis

You need rapid insights, not production-ready dashboards

Advanced Analysis is perfect for quick exploratory questions or what I’d call “back-of-the-envelope” data thinking. This isn’t your polished monthly reporting dashboard you’re sending off to stakeholders. It’s the quick glance into the crystal ball to validate a hunch or a hypothesis.

Maybe you need to rapidly check a correlation between customer churn and pricing, or forecast sales for the next month without setting up an elaborate forecasting model. Advanced Analysis lets you dive into the numbers quickly, guided by Python, to confirm or disconfirm your intuition.

If you get a good result, fantastic! If not, you just move on without spending hours wrestling with complicated models.

You’re exploring advanced scenarios beyond Excel’s traditional capabilities

Some analytical tasks, like clustering customer groups, running a quick regression model, or forecasting seasonal sales, are difficult or impossible in native Excel. PivotTables and slicers are fantastic for simple grouping and summarizing… but not so great for nuanced machine learning problems.

Copilot’s Advanced Analysis bridges the gap. It offers tools like ARIMA models for forecasting, linear regression models for quick predictions, and even K-means clustering right inside Excel.

If these are the types of tasks you frequently encounter, or want to explore, Advanced Analysis could quickly become a new power tool in your toolkit.

You understand a bit about Python (or at least want to learn)

To really get value out of Advanced Analysis, you need some basic familiarity with Python, or at least a willingness to learn as you go. Copilot will generate code snippets for you, but you have to be comfortable enough to understand generally what’s happening.

You don’t need to be a Python wizard. But if terms like “pandas,” “NumPy,” or “Matplotlib” don’t sound at least vaguely familiar, you’re going to find it harder to truly benefit. The beauty is that using Copilot helps you quickly gain exposure to Python. Think of it as having a mentor who gently guides you through the language, step-by-step.

You’re comfortable with ambiguity and quick iterations

Here’s where the analyst mindset becomes crucial. Analysis, by definition, deals with uncertainty. Copilot helps you rapidly test hypotheses and make informed guesses. But yes, sometimes it’ll be wrong, or at least imperfect.

The thing is, human analysts get things wrong too. Copilot doesn’t absolve you from having analytical intuition. Like a driver using Google Maps, you still need enough common sense to avoid driving into lakes or going entirely in the wrong direction.

Ambiguity comes with the territory, and if you’re comfortable with that, Advanced Analysis will feel intuitive and powerful.

When NOT to rely on Advanced Analysis

Okay, so we’ve covered where Copilot shines. But let’s quickly address where it doesn’t.

You primarily need routine reporting or basic calculations

If your job is mainly generating routine reports, basic aggregations, or simple visualizations, the complexity of Python and AI might be unnecessary. A PivotTable or even basic formulas and functions can often get the job done more quickly.

Sure, Copilot could probably help you here too, but you’re likely adding complexity without sufficient benefit.

You’re sharing your workbook widely, immediately

Advanced Analysis generates Python code snippets in your workbook. While you can audit and tweak this code, it’s not exactly packaged neatly for wide distribution. If your colleagues aren’t comfortable with Python, sharing this could be more trouble than it’s worth.

Use Advanced Analysis for your personal exploration, then perhaps transfer insights to native Excel features if broader sharing is required.

You’re uncomfortable with uncertainty in your workflow

Let’s be honest: Copilot is AI. AI, even incredibly powerful AI, sometimes produces off results. If you’re someone who needs to confirm every detail with external sources or deeply authoritative references at every step, Advanced Analysis might feel uncomfortable.

There’s nothing wrong with verifying rigorously, but the biggest advantage of Advanced Analysis (its speed and agility) can be lost if you constantly feel compelled to double-check everything manually. It’s about comfort level: can you trust your intuition and the general plausibility of results, or must everything be confirmed by a secondary source?

Should you dismiss Advanced Analysis if you don’t “need” Python tasks?

Interestingly enough, even if you don’t often tackle traditional Python tasks, Advanced Analysis might still offer real value for your workflow.

Here’s why: Copilot is trained on an enormous amount of data… way beyond just Excel’s built-in functions and formulas. A substantial part of this training corpus is Python code, which is precisely why Copilot excels at translating your analytical needs into Python operations. Python itself is a fantastic interpreter language for advanced analytical tasks: large language models (LLMs) alone struggle with direct mathematical computations, but they’re extremely good at generating Python code that can handle those operations effortlessly.

This means that sometimes Copilot’s recommendations might genuinely outperform Excel’s built-in Analyze Data tool or other basic AI features, even for seemingly straightforward tasks. It can surface correlations, insights, or patterns that your usual toolkit might overlook, simply because it draws on a much broader training base.

Beyond that, Advanced Analysis serves as a powerful educational resource. You’ll get rapid, practical exposure to Python concepts and analytical methods that you might otherwise never encounter. It’s a great way to expand your analytical skillset without the friction of diving deep into Python from scratch.

Final thoughts: Embrace the ambiguity, enjoy the speed

In the end, the key factor determining whether Advanced Analysis with Copilot for Python in Excel is right for you is comfort with ambiguity and your analytical intuition. Do you enjoy quickly iterating on ideas, using a tool that offers rapid insights at the expense of occasional uncertainty?

Analysis, after all, is often less about certainties and more about informed probabilities. Just like Google Maps or any other algorithm-driven tool isn’t always perfect, Copilot isn’t infallible. The analyst’s role is interpreting, evaluating, and yes, occasionally correcting the AI’s output.

Advanced Analysis isn’t a silver bullet, but it is an incredibly sharp arrow in your analytical quiver. If you’re ready to quickly test ideas, explore new analytical frontiers, and embrace a bit of ambiguity, this might just become your new favorite tool.

And if you still have questions, or want help integrating Advanced Analysis into your workflow, don’t hesitate to reach out. I’m always here to help you make sense of Excel’s evolving AI landscape and turn these innovations into real-world results.

The post How to understand Advanced Analysis with Python for Copilot in Excel first appeared on Stringfest Analytics.

Advanced analysis with Python in Copilot: How to work with text data

George Mount — Tue, 01 Oct 2024 17:52:09 +0000

Excel has long been a go-to tool for data analysts to crunch numbers and perform data analysis. Traditionally, its ability to handle text data, particularly for advanced analysis tasks like sentiment analysis, was seen as limited. Analysts seeking meaningful insights from text often resorted to more specialized tools or cumbersome workarounds.

However, Excel’s data analysis capabilities are rapidly evolving with the integration of Python—a programming language celebrated for its extensive libraries that support data science and natural language processing (NLP). This integration significantly expands Excel’s functionality, making sophisticated text analysis accessible to a broader audience.

Additionally, the introduction of Copilot in Excel marks a significant advancement in text data processing. Copilot employs advanced machine learning algorithms to function as a smart assistant within Excel, capable of automating repetitive tasks, crafting complex data transformations, and delivering insights more efficiently than traditional methods.

This post explores how Python and Copilot now enable advanced text analysis in Excel. We will discuss everything from basic text preparation to complex NLP techniques such as sentiment analysis and predictive analytics. With these tools, Excel is well-equipped to handle detailed NLP tasks.

You can follow along by downloading the exercise file below:

Download the exercise file here

This well-known dataset comprises 50,000 movie reviews, each labeled with either positive or negative sentiment. To begin analyzing this data, please start the Advanced Analysis session by following the steps detailed in a previous blog post:

How to get started with Advanced Analysis with Python for Copilot in Excel

As noted in that blog post, the outputs provided by Copilot and Advanced Analysis may vary, potentially leading to different outcomes with each attempt. For instance, I obtained a set of data visualizations built with Python code, as seen below:

Let’s explore each visualization and what it reveals about the data. You might also consider using Copilot for assistance here.

The first chart illustrates the distribution of reviews classified as either positive or negative. This visualization is crucial as it provides a quick view of the overall sentiment balance within the reviews, allowing analysts to discern which sentiment prevails. Such information is essential for understanding general audience reception and can be particularly valuable when building predictive models, as many models perform better with an equal number of records in each category.

The second chart shows the distribution of review lengths, measured by the number of words per review. This chart is insightful as it indicates the typical length of reviews, reflecting the depth of analysis usually provided by reviewers. This information could aid in assessing the comprehensiveness of reviews and might be used to filter out very short, potentially less informative reviews.

The third visualization, a word cloud, displays the most frequent words found in positive reviews, with larger sizes denoting higher frequency. While this highlights common themes or terms in positive feedback, helping to understand which aspects of movies are most appreciated, the visualization has several drawbacks. It lacks quantitative precision, as it does not provide exact counts or an easy way to compare the frequency of different terms. Furthermore, words in a word cloud are stripped of context, and common but uninformative words can dominate. The size of words might also lead to misleading interpretations of their significance without considering the context in which they are used.

To address some of these issues, I’m going to attempt to reduce the noise and explore differences in word choice between positive and negative sentiments by asking Copilot to generate the Python code for the following query:

“What are the top 15 most common words for each sentiment?”

Copilot will generate the Python code, and it appears to return the results as a tuple containing DataFrames. If you’re not familiar with this data structure, that’s fine, but understanding it will help clarify why this data isn’t directly visible in your workbook.

If you find yourself in a situation where you can’t directly see the results, requesting a visualization is a great strategy. Visualizations ensure that you can clearly observe the outcomes produced by Copilot. For instance, we can use a visualization to effectively display the top 15 words by sentiment from each review, making it easier to interpret the data. This approach not only helps in visual comprehension but also enhances the presentation and accessibility of the analysis.

You’ll notice that while common positive words like “good” and “great” appear in positive reviews, there are also many neutral words such as “film,” “film,” and “movie.” Interestingly, even negative reviews frequently use the word “good.” This highlights the complexity and nuance of language, demonstrating how challenging it can be to work with text data. The presence of typically positive words in negative reviews can reflect the subtleties in how people express mixed or nuanced sentiments.

Given these complexities, text data analysis requires sophisticated approaches to accurately capture and interpret sentiments. This is a prime example of how advanced text analytics can reveal the nuanced use of language, ironically through the very tools designed for such analysis like Copilot.

Now, let’s take this further by building a predictive model. I will ask Copilot to help us create a model that predicts the sentiment of a review based solely on its text. This step will allow us to apply what we’ve learned about text data and explore machine learning techniques to handle sentiment analysis more effectively.

Copilot provides data, once again as a tuple, but this time it gets printed directly. It’s a logistic regression model, and we are told that it can predict the sentiment of a review with about 88% accuracy. Interesting! However, it’s crucial for you to validate and sanity-check these results. You can even enlist Copilot’s help for this. For instance, I just asked Copilot to explain some of the assumptions required for using logistic regression to classify text sentiment. It provided valuable insights, and I could ask it to start verifying these assumptions. For example, it checks for balanced samples and even calculates the frequencies by group in the worksheet, allowing you to easily confirm these details.

The ability to analyze text data truly opens up a new frontier in Excel, but remember, this is advanced territory. It’s important to understand the pros and cons of the methods and techniques used with Python and Copilot for sentiment analysis.

Do you have any questions about text analytics with Python for Copilot in Excel or about the Advanced Analysis feature more generally? Let me know in the comments.

The post Advanced analysis with Python in Copilot: How to work with text data first appeared on Stringfest Analytics.

How to conduct sentiment analysis in Excel with Azure Machine Learning

George Mount — Sat, 01 Apr 2023 15:07:04 +0000

Traditionally we’ve considered Excel a tool to be used only with relatively small, structured (i.e. rows with columns of consistent datatypes, etc.) data:

#bigdata are unstructured data such as facebook posts, picts, sms, emoji that we cannot simply put into Excel sheet columns

— UX Jabs (@UXjabs) April 25, 2017

However, the typical “rules” about what Excel can and can’t do are getting blurrier as more features are added — particularly when it comes to artificial intelligence and machine learning (AI/ML) features. For example, we can use Excel to detect the sentiment found in a series of text reviews. Let’s find out how:

Download the exercise file here

This dataset contains restaurant reviews to classify as positive, negative, or neutral. Manually assessing a few reviews isn’t a problem, but it becomes challenging with thousands or more. To automate this task, we’ll use Azure’s text analytics features.

What is sentiment analysis and why use it?

Sentiment analysis is a data analysis tool that uses machine learning algorithms to decipher emotions and opinions in unstructured data. It often categorizes text as positive, negative, or neutral, enabling businesses to improve customer satisfaction and address concerns based on overall sentiment towards a brand, product, or service.

Loading the Azure ML add-in

The first step is to load the Azure Machine Learning add-in to your workbook. From the ribbon, head to Insert > Add-ins > Get Add-ins, then search for Azure Machine Learning:

Click Add, then Continue. You should now see the Azure Machine Learning add-in appear to the right of your Excel session. Click the second option in this menu, Text Sentiment Analysis (Excel Add-in Solver), to continue.

Creating the schema

Azure does all the computing needed for sentiment analysis, but expects the input data to be in a specific format, or schema. Structuring data in a machine-friendly way is essential for machine learning and AI to work accurately and efficiently.

For this particular task, we need to create three column headers in the workbook: tweet_text, Sentiment, and Score. You can find these under the View Schema section of the sentiment analysis menu.

The first column header, tweet_text, is where we input the restaurant reviews. Despite its name, this column header can handle full reviews, not just tweets. Azure’s sentiment analysis add-in will still work effectively with this column.

The Sentiment and Score columns will be populated by Azure’s sentiment analysis add-in.

Running the sentiment analysis

Now that we’ve set up the schema, we can define the input for sentiment analysis in the Predict section of the add-in. The input area spans cells A1:A9, including the header. Be sure to turn on “My data has headers” option.

For the output area, the results of the sentiment analysis will be returned to cell B1. Set this cell to confirm that the results will start appearing here.

After clicking on the Predict button, you will see columns B and C populated with results.

Unfortunately, this process can be glitchy at times. If you encounter any issues, double-check the schema and inputs, or try restarting Excel.

Interpreting the scores

As expected, Azure has classified each review as positive, negative, or neutral and recorded it in the Sentiment column. The third column, Score, contains a raw number from 0 to 1 generated by Azure. The higher the score, the more positive the sentiment. These scores are then categorized into negative, neutral, and positive groups.

If you’re wondering how Azure generates these scores, it’s through a complex machine learning model that only Azure can explain.

Tryna figure out how these AI/ML models work be like: pic.twitter.com/Ts5YemuXTD

— George Mount (@gjmount) February 22, 2023

Automated tools like this are convenient, but they often lack transparency and explainability. Although undeniably powerful, these tools are not infallible. For instance, in the sentiment analysis results, rows 3 and 5 were labeled as neutral and negative, respectively, despite being negative and positive reviews upon reading.

Moral of the story: use AI to its fullest extent, but don’t let it think for you. After all, it’s got artificial intelligence, but you’ve got the real thing.

AI, unstructured data and Excel

Unstructured data is notoriously difficult to work with, although AI is well-disposed to working it. And while Excel is generally meant to work with structured data, you will likely see more applications for using it with text, images and more. The forthcoming integration of GPT-powered language modeling in Excel with Copilot is a significant step forward.

Have you work with much unstructured data in Excel? Have you ever used sentiment analysis, and for what? What AI features do you wish would come to Excel? Let me know in the comments.

The post How to conduct sentiment analysis in Excel with Azure Machine Learning first appeared on Stringfest Analytics.

How to perform predictive analytics in Excel using the XLMiner add-in

George Mount — Tue, 21 Feb 2023 18:07:15 +0000

When you think of predictive analytics, maybe you picture programming languages like R and Python or cloud platforms like Azure or Amazon AWS. The humble Excel spreadsheet is probably not top of mind — but it can serve its purpose. Let’s take a look at what XLMiner can do for predictive analytics and, by extension, AI:

Download the exercise file here

Excel as data science wireframe

Predictive analytics models are complicated to build and deploy. There’s a danger to setting forth a creation you don’t really understand (ever heard of Frankenstein?). Excel offers a simple, tactile prototyping tool for constructing them. We’ll use an add-in to do most of the heavy mathematics lifting, while still getting hands-on with the data.

Installing the XLMiner add-in

XLMiner is a free add-in for statistical analysis that works on most versions of Excel, including the web. To get started, head to the ribbon then Insert > Add-ins > Get Add-ins. From the Office Add-ins menu, search for XLMiner and click Add:

Agree to the Terms & Conditions, click OK and you should see an XLMiner menu on the sidebar of your screen. As you’re seeing, XLMiner comes with plenty of statistical tools and techniques. Let’s focus on the “mother of all models,” linear regression.

Linear regression on on housing prices

We will use price as the dependent variable and lotsize, airco and prefarea as the independent variables. Head to the Linear Regression section of the XLMiner menu, fill it out like so, then click OK:

Unfortunately, it can be difficult to use the drag-and-drop feature in XLMiner to name an input range, so you may need to physically type in the cell locations.

Don’t forget the assumptions!

It can be tempting to jump right into building models and making predictions as we did, but in practice it’s necessary to explore the data and check whether it meets the assumptions of whatever model you’re using.

To be fair, Python and R provide much more robust environments for these regression checks, but if you’d like that hands-on approach that only Excel can provide, check out my book Advancing into Analytics: From Excel to Python and R. XLMiner can also provide additional outputs to help in checking these assumptions.

Evaluating the results

You should see the following output from XLMiner after running the regression:

Here you have typical regression diagnostics such as coefficient p-values, R-square and more. If you’d like to learn more about interpreting these, check out Advancing into Analytics. Aside from these measures, what kind of analytics are we doing? Is this AI?

Does this count as AI?

These days, it can seem like a data product is a snoozefest unless it’s AI-powered. So does linear regression in XLMiner count as AI? Sort of — it depends on how the regression model is used.

AI and predictive analytics

Artificial intelligence is what it sounds like — tasks done by computers that normally would require human intelligence, such as transcribing speech, detecting pictures and so forth. Predictive analytics is using data and statistical models to forecast what will happen in the future. Making predictions is a typical human task that can be done “artificially”, but few would expect humans to make predictions the way that computers do, by following a strict statistical model. That puts predictive analytics near the AI camp, but not in it.

Linear regression and predictive analytics

To make matters more complicated, we didn’t really just do predictive analytics anyway with this regression model, although we could have…

Think about how regression works: we fit a line to our data and decide how well that line describes a relationship between independent and dependent variables. This tells us about patterns in past data, but doesn’t in itself make predictions about the future. That makes basic regression a form of diagnostic analytics, or analysis about why things happened in the past.

As this article from Harvard Business School puts it:

When regression analysis is used to explain the relationships between variables in a historical context, that’s an example of diagnostic analytics. The regression can then be used to develop forecasts for the future, which is an example of predictive analytics.

As the article implies, however, it’s simple enough to turn a regression into predictive analytics — just start making predictions! Let’s try that now.

Making point predictions

To get started, we can predict the value for just one datpoint, known as a point prediction. For example, what is the predicted sale price for a home with a lotsize of 4,000, with no air conditioning and in a preferred area? Let’s plug in the coefficients and find out:

Easy enough.

However, predictive analytics usually entails making a larger set of predictions using formal performance metrics. The first step is often to split the data into training and testing sets — I show how to do that in Excel in this blog post.

XLMiner as a gateway to AI

The power that you have in using AI will only grow as you understand its foundations. So while XLMiner in itself isn’t an AI-powered tool, it’s a great tool for getting to terms with how machine learning and predictive analytics work.

What questions do you have about XLMiner and predictive analytics in Excel? How do you see the relationship between AI, Excel and predictive analytics? Let me know in the comments.

The post How to perform predictive analytics in Excel using the XLMiner add-in first appeared on Stringfest Analytics.

How to create training and testing data samples in Excel

George Mount — Mon, 20 Feb 2023 20:46:23 +0000

A common practice in machine learning and predictive analytics is to split a dataset into training and testing subsets. The training subset is used to build the actual model, while the testing is used to check its overall performance on new data.

Generally R or Python provide a richer environment than Excel for conducting machine learning, but as I write in Advancing into Analytics, there can be real gains to using spreadsheets as a prototyping and learning tool.

In that spirit, let’s practice splitting a housing prices dataset in Excel.

Download the exercise file here

Creating a split column

Many methods exist for splitting a dataset, but the most basic is an entirely random shuffle. We can do that with the help of Excel’s RAND() function. Go ahead and create this column in your housing table. Important: be sure to copy-paste values once you’ve run this function to avoid later issues with splitting the datasets.

Creating the split

Typical train-test splits include 70/30 and 75/25. To avoid hard-coding any of these numbers in the workbook, create a split worksheet where the user can key in any ratio. The training ratio will be used to drive later calculations, so it’s a good idea to name this cell split:

Next, we’ll use dynamic arrays to create the training and testing subsets.

Creating the training and testing sets

Create a new worksheet training in your workbook. We’ll put the 70% of training observations here. Ideally we should create a copy of these values while leaving the original intact. The FILTER() function lets us do exactly that.

We are going to filter the housing dataset to include only the records where RAND() is less than or equal to .7. Because RAND() is a uniform or evenly distributed number generator, we should expect about 70% of observations to fall in this range.

Here’s the formula:

=FILTER(housing, housing[rand]<=split)

The FILTER() function does not carry through the column headers from the refernce dataset, so go ahead and add those with the formula:

=housing[#Headers]

Looking good! We’ll double-count our observations soon… for now, do the same with a testing worksheet except use the inverse formula:

=FILTER(housing, housing[rand]>split)

Checking our work

Great work splitting your data into training and testing subsets using Excel! One downside to this approach is that because the numbers generated by RAND() will be different than yours, there will be variations in what’s seen on your workbook. This can be fixed in programs like R and Python by setting the random seed.

We’ll overlook it for now and simply double-check that the number of rows in the training and testing datasets are approximately 70/30. To do that, I used the COUNT() function to tally up how many ID numbers appear in column A of each worksheet. (Keep in mind that COUNT() does not include text cells, so the column headers are excluded.)

It looks indeed as if the data has been approximately been split 70/30. It’s not perfect for the simple reason that 546 isn’t perfectly divisible by these numbers!

Not perfect, but a start

The FILTER() function makes splitting an Excel dataset relatively straightforward. It’s not the most elegant solution, but it beats what would have come before dynamic arrays. From here, you can build and evaluate models using tools like the XLMiner add-in, right from Excel.

Have you needed to randomly split your dataset in Excel for machine learning or other reasons? How did you do it? What questions do you have about machine learning in Excel? Let me know in the comments.

The post How to create training and testing data samples in Excel first appeared on Stringfest Analytics.