data preprocessing - Stringfest Analytics https://stringfestanalytics.com Analytics & AI for Modern Excel Mon, 25 Aug 2025 23:14:41 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 https://i0.wp.com/stringfestanalytics.com/wp-content/uploads/2020/05/cropped-RGB-SEAL-LOGO-STRINGFEST-01.png?fit=32%2C32&ssl=1 data preprocessing - Stringfest Analytics https://stringfestanalytics.com 32 32 98759290 Python in Excel: How to do feature engineering with Copilot https://stringfestanalytics.com/python-in-excel-how-to-do-feature-engineering-with-copilot/ Mon, 25 Aug 2025 23:14:37 +0000 https://stringfestanalytics.com/?p=15675 Feature engineering involves transforming raw data into meaningful variables to enhance machine learning models. It includes creating new features, encoding categorical data, and scaling or normalizing values—tasks that significantly boost model accuracy and insights. This practice naturally overlaps with data cleaning, as both involve handling missing values, inconsistencies, and outliers. Feature engineering also aligns with […]

The post Python in Excel: How to do feature engineering with Copilot first appeared on Stringfest Analytics.

]]>
Feature engineering involves transforming raw data into meaningful variables to enhance machine learning models. It includes creating new features, encoding categorical data, and scaling or normalizing values—tasks that significantly boost model accuracy and insights. This practice naturally overlaps with data cleaning, as both involve handling missing values, inconsistencies, and outliers. Feature engineering also aligns with exploratory data analysis (EDA), since insights from EDA often guide effective feature creation.

For Excel users, mastering feature engineering expands the potential for more sophisticated analysis and predictive modeling. While Excel’s Power Query provides powerful data transformation tools, certain advanced tasks, especially those involving complex statistical rules or extensive group-wise transformations, are much easier and more efficient using Python and Copilot, enabling streamlined and scalable operations beyond standard Excel workflows.

In this post, you’ll explore advanced feature engineering techniques with the Palmer Penguins dataset. Download the exercise file below to follow along:

 

Dummy-coding categorical variables

Categorical variables like species or island are common in datasets but can’t be directly interpreted by most predictive models. Typically, these categories need conversion into numeric dummy variables, representing each distinct category as its own binary column. We began our analysis using the following prompt:

“Convert the categorical variables species, island, and sex to numerical dummy variables suitable for modeling.”

Reddit dummy coding

Copilot quickly generated dummy variables, clearly converting each categorical variable into binary columns for each category (e.g., species_Adelie, island_Biscoe, sex_Male). This numerical format allows our data to be effectively used by predictive models and machine learning algorithms.

Binning quantitative variables

Sometimes numeric values like body mass provide more insight when grouped into meaningful categories rather than considered individually. To efficiently address this, we gave Copilot the following prompt:

“Bin body_mass_g into three categories (light, medium, heavy) based on quantiles, and add this as a categorical feature.”

Quantile binning Copilot

Copilot quickly responded by categorizing penguin body mass into three intuitive groups—light, medium, and heavy—based on the underlying quantiles of the data. The output clearly indicates the category each penguin falls into, turning numeric complexity into easy-to-understand categorical labels.

By creating quantile-based categories, analysts can quickly identify patterns, make clearer comparisons across groups, and feed simplified, meaningful variables into predictive models—all achieved effortlessly through Python and Copilot integration directly within Excel.

Creating group-wise statistical measures

Numeric summaries grouped by categories, such as median values by species, are frequently essential features for deeper data insights. To efficiently create these features, we provided Copilot with this prompt:

Calculate the median body mass by species and add this as a new numeric feature to each row of the dataset.

Median body mass by species

Copilot quickly calculated the median body mass for each penguin species and added these values directly into each row as a new numeric feature. This new column allows us to easily identify how an individual penguin compares against the typical body mass of its species, highlighting significant deviations or confirming typical measurements.

Standardizing variables with Z-score scaling

Numeric features in datasets often vary greatly in scale, making them challenging to compare directly. For instance, a penguin’s body mass (measured in grams) naturally has a larger numeric scale than its bill length (measured in millimeters). This discrepancy can distort analyses, especially when performing predictive modeling or clustering.

To address this issue clearly and efficiently, we gave Copilot the following prompt:

Apply standardization (Z-score scaling) to the numeric columns bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g using Python.”

Copilot standardized z-scores

Copilot quickly applied Z-score scaling to our selected numeric features. This transformation converts each numeric value into a standardized score representing how many standard deviations it is from the feature’s average. A standardized value near 0 indicates the measurement is close to the average, while positive or negative values reflect deviations above or below the average, respectively.

Standardization makes these numeric features directly comparable, allowing each to contribute equally and meaningfully to subsequent analyses.

Advanced missing value imputation

Dealing with missing values is a common challenge in data analysis. Excel and Power Query offer basic options for handling missing data, but these tools fall short when it comes to advanced predictive imputation, especially when we want to use similar, complete observations to estimate missing values.

To effectively overcome this limitation, we provided Copilot with the following prompt:

“Perform advanced imputation by predicting missing values in each row based on the most similar complete observations in the dataset.”

KNN imputation

Copilot quickly applied the K-Nearest Neighbors (KNN) imputation method, filling in missing numeric values by finding the most similar penguins in the dataset and using their known measurements to predict missing data. The resulting dataset now has complete observations for each numeric feature, providing a robust foundation for further modeling and analysis.

Check out this post for a little more on k-nearest neighbors with Copilot, Python and Excel:

Conclusion

Feature engineering is foundational for achieving robust, predictive analytics. By embracing Python and Copilot alongside Excel, you’re positioned to execute sophisticated transformations that previously required significant manual effort or were simply unattainable with basic Excel or Power Query techniques alone.

Yet, while Python-powered feature engineering significantly expands your analytical capabilities, it’s essential to remain mindful of potential limitations. Advanced methods require clear understanding and interpretation to avoid unintended biases or misrepresentations in your data. Additionally, ensuring seamless integration between Python-driven transformations and traditional Excel workflows will require thoughtful structuring and documentation.

It is crucial to deeply engage with your data through exploratory and visual methods to verify that the engineered features truly address your analytical needs. Feature engineering inherently involves substantial trial and error. Although Copilot significantly accelerates experimentation and automates repetitive tasks, its effectiveness hinges upon clear and accurate guidance. Without properly defined parameters, Copilot might inadvertently speed up the creation of irrelevant or misleading features.

Moving forward, consider experimenting with these techniques on your own datasets to identify unique opportunities for improvement. Challenge yourself to combine the intuitive visualizations of Excel with the scalable computational power of Python, leveraging Copilot’s AI assistance to bridge gaps and streamline your processes. As your comfort grows, you’ll uncover new strategies to elevate your analyses, enabling more impactful and data-driven decision-making.

The post Python in Excel: How to do feature engineering with Copilot first appeared on Stringfest Analytics.

]]>
15675
Python in Excel: How to do k-nearest neighbors with Copilot https://stringfestanalytics.com/python-in-excel-how-to-do-k-nearest-neighbors-with-copilot/ Sun, 24 Aug 2025 16:34:06 +0000 https://stringfestanalytics.com/?p=15627 K-Nearest Neighbors (KNN) is a straightforward and powerful analytical tool that helps you make predictions by looking at the most similar cases or “neighbors” in your data. It’s intuitive because it mirrors how we often make real-life decisions: by finding similar past scenarios and expecting similar outcomes. Historically, though, Excel users faced significant challenges when […]

The post Python in Excel: How to do k-nearest neighbors with Copilot first appeared on Stringfest Analytics.

]]>
K-Nearest Neighbors (KNN) is a straightforward and powerful analytical tool that helps you make predictions by looking at the most similar cases or “neighbors” in your data. It’s intuitive because it mirrors how we often make real-life decisions: by finding similar past scenarios and expecting similar outcomes.

Historically, though, Excel users faced significant challenges when attempting to implement KNN, as Excel has no built-in functionality for this method. Analysts had to rely either on external software or complex, error-prone spreadsheet formulas.

Fortunately, this hurdle no longer exists. With Python integration and Copilot’s Advanced Analysis directly inside Excel, running KNN models has never been easier. You can now use Python’s predictive power effortlessly within Excel, guided clearly through each step by Copilot.

In this post, we’ll use Python and Copilot in Excel to demonstrate how analysts can quickly build and interpret KNN regression models. Using the popular Auto MPG dataset, we’ll predict car fuel efficiency based on attributes like horsepower, weight, and acceleration. Download the exercise file below to follow along:

 

Setting the business context

We begin our analysis with the following prompt to ensure a clear understanding of our data and why we’re using KNN:

“Briefly summarize what this dataset is about and explain how using K-Nearest Neighbors could help an analyst predict car fuel efficiency (MPG).”

Copilot responds with a concise summary of the dataset and clearly articulates the value of KNN for the business problem at hand:

Preparing your data for KNN

Because KNN relies on measuring the distance between data points, it’s essential that your features are similarly scaled and clean of missing values or extreme outliers. Use the following Copilot prompt to handle this step:

“Check the dataset for missing values and extreme outliers in horsepower, weight, and acceleration. Standardize these features so they have comparable scales. Briefly explain why this step matters for KNN.”

K-nearest neighbors data prep

It confirmed clearly that there are no missing values in any of our chosen features. This is good news. It means every data point is valid and usable for analysis. However, Copilot did identify several extreme outliers: 10 in horsepower and 11 in acceleration. Extreme values like these can significantly impact the accuracy of KNN predictions, as they can distort distance calculations between data points.

Copilot’s output also explains why standardization is critical for our KNN model. Because KNN works by measuring the similarity between observations using distance metrics, features measured on different scales can bias these calculations. Without standardization, larger numeric scales (like weight) would overshadow smaller ones (like acceleration), even if they’re equally important. By standardizing our features, we ensure that each contributes equally when determining similarities and predicting fuel efficiency.

Building the KNN model

Next, we’re ready to build our KNN model and start making predictions. To do this, we asked Copilot the following prompt:

“Build a basic KNN regression model to predict MPG using horsepower, weight, and acceleration. Briefly explain how this model uses these features to make predictions.”

KNN regression model

Copilot responded clearly, building a KNN regression model that predicts fuel efficiency (MPG) based on three important car attributes: horsepower, weight, and acceleration. The way KNN regression works here is quite intuitive: when we want to predict MPG for a new car, the model searches the dataset for cars with similar horsepower, weight, and acceleration… the “nearest neighbors.” It then averages the actual MPG of these nearest neighbors to estimate the MPG of our new car.

In this case, Copilot specifically selected five similar cars to make each prediction. It also provided us with a measure called mean squared error (MSE), which came out to about 16.26. The MSE tells us how close, on average, our predictions are to the actual MPG values. A lower MSE means predictions are generally closer to reality, while a higher MSE means there’s more room for improvement.

You might be wondering why we’re calling this “KNN regression.” That’s because KNN can be used in two main ways: regression and classification. In our scenario, we’re performing regression since we’re predicting a numeric outcome (MPG). Alternatively, KNN classification predicts categorical outcomes, such as whether a customer churns (“yes” or “no”) or identifying a product category.

If you’re curious or feel unclear about these concepts, feel free to ask Copilot to further clarify the difference between KNN regression and classification… it’s always ready to help you build confidence and understanding as you continue your analytics journey!

Evaluating the KNN model

Next, we want to be sure our KNN regression model genuinely works… not just on cars it’s already seen but also on new, unseen examples. To accomplish this, we gave Copilot the following prompt:

“Split the data into training and testing sets (80/20). Train the KNN model on the training set and evaluate its prediction accuracy on the testing set. Explain briefly why this testing step is important.”

Copilot split the data, trained the KNN regression model on the training set, and evaluated predictions on the test set, resulting in a mean squared error (MSE) of approximately 16.26. This testing step is crucial because it ensures our model reliably predicts MPG for new cars it hasn’t encountered before, rather than just memorizing data it’s already seen.

Notably, this MSE of about 16.26 matches our earlier result, suggesting the model is stable and consistently reliable across different subsets of data.

Visualizing KNN predictions

Next, we want a clear visual sense of how well our KNN regression model predicts MPG compared to actual values. We used the following prompt:

“Create a scatter plot showing actual versus predicted MPG. Briefly interpret the visualization. Does the model predict accurately across all ranges?”

KNN scatterplot

Looking at our visualization, we see most points cluster closely around the red dashed line, indicating that overall, our KNN model predicts fairly accurately. However, we notice some variation, particularly at the higher MPG range, where predictions deviate a bit further from the perfect prediction line.

This pattern suggests our model does better at predicting mid-range MPG values, while slightly struggling with extreme cases. Understanding these strengths and limitations helps us trust the model’s insights appropriately and identify where improvements might be beneficial.

Optimizing our model

Lastly, we wanted to fine-tune our KNN model by determining the optimal number of neighbors (k) to use. For this, we gave Copilot the following prompt:

“Run the KNN model again using 3, 5, and 7 neighbors. Identify which number of neighbors (k) produces the lowest prediction error and briefly explain why choosing the right number of neighbors matters.”

Copilot KNN N comparison

Copilot tested our KNN model using three different neighbor counts (3, 5, and 7) and returned the clear table shown in our screenshot. From this table, we see that the mean squared error (MSE) is lowest—about 15.72—with 7 neighbors. However, the MSE at 5 neighbors is very similar (around 16.26), and we’ve already seen consistent, stable results at this number.

While the error with 7 neighbors is slightly lower, sticking with 5 neighbors makes practical sense. A model that’s slightly simpler (fewer neighbors) tends to be easier to interpret and faster to calculate. Given that the difference in accuracy is minimal, choosing 5 neighbors offers a good balance of simplicity, interpretability, and prediction accuracy, ensuring stable and reliable results without unnecessary complexity

Understanding predictions for better decisions

Finally, we asked Copilot to provide an intuitive example to clearly understand how our KNN model makes predictions:

“Select one vehicle from the test dataset and identify its three closest neighbors used in the prediction. Briefly explain how the similarity between these vehicles (in terms of horsepower, weight, and acceleration) helps us intuitively understand why the model predicted its fuel efficiency (MPG) as it did.”

KNN predictions example

Copilot responded by selecting one vehicle from the test set with a horsepower of 69, weight of 2189, acceleration of 18, and actual MPG of 26. It identified the three most similar cars from the training dataset, each with very comparable horsepower (65-67), similar weights (2065-2145), and nearly identical acceleration values (17.8-18). These vehicles had actual MPG values of 33.8, 46.6, and 32.3.

Because these vehicles are so closely matched in their key features, Copilot averages their MPG to make a prediction. This similarity-based reasoning makes KNN highly intuitive. By clearly seeing how similar cars performed, we gain trust in the model’s prediction. It’s essentially making recommendations based on how similar situations turned out in the past. This practical, intuitive insight makes KNN regression especially useful for analysts who value clear interpretability in their models.

Conclusion

K-Nearest Neighbors models offer Excel analysts a simple yet powerful tool for predictive analysis, bridging intuitive decision-making with data-driven insights. Thanks to Python and Copilot integration, building and interpreting these models is now accessible directly within Excel. This empowers analysts in marketing, finance, and operations to confidently predict outcomes, test business scenarios, and provide actionable recommendations—all without extensive coding or statistical training.

As with any analytical method, KNN’s effectiveness hinges on data quality and thoughtful parameter tuning. Regularly revisiting your model with new data and adjusting your approach ensures your insights remain practical and impactful.

Looking forward, analysts can explore other predictive modeling methods, comparing results with KNN to reinforce conclusions. The combination of your domain knowledge with these intuitive predictive tools ensures your recommendations remain insightful, actionable, and aligned with real-world business objectives.

The post Python in Excel: How to do k-nearest neighbors with Copilot first appeared on Stringfest Analytics.

]]>
15627
Python in Excel: How to transform text with Copilot https://stringfestanalytics.com/transformtextcopilot/ Fri, 25 Jul 2025 17:30:12 +0000 https://stringfestanalytics.com/?p=15616 Text transformation and analysis has never exactly been Excel’s strong suit. Sure, native text functions have significantly improved in recent years with new parsing and even regular expression functions, but in my opinion the real game-changer has been the integration of Python in Excel. Plus, Python’s extensive history with text manipulation, and the sheer volume […]

The post Python in Excel: How to transform text with Copilot first appeared on Stringfest Analytics.

]]>
Text transformation and analysis has never exactly been Excel’s strong suit. Sure, native text functions have significantly improved in recent years with new parsing and even regular expression functions, but in my opinion the real game-changer has been the integration of Python in Excel.

Plus, Python’s extensive history with text manipulation, and the sheer volume of tasks it’s been used for, means there’s far more training data available for Copilot to leverage. This, in my experience at least, translates to Copilot understanding text-related prompts even better than pure Excel tasks. With this combination of Python’s power and Copilot’s clarity, tasks that would have been complex, cumbersome, or outright impossible using Excel alone now become surprisingly straightforward….assuming you clearly know what you’re after and can express it effectively in your prompts.

Let’s dive into some of these use cases for quick wins in Python-powered text transformations right within Excel. You can download the exercise file below:

 

If you haven’t explored the Advanced Analysis features of Python in Excel yet, take a look at this post:

Fuzzy matching

We have a small set of customer reviews, and one useful step here would be to standardize them for spelling and consistency. This scenario comes up frequently: for example, with customer names, product titles, or vendor lists. Imagine you have a messy dataset of manually-entered customer names, filled with typos or variants like “General Electric,” “Gen Electric,” and “GE Corp.” You’d want to match these against a clean, authoritative dataset as your “source of truth.”

Fuzzy matching helps identify approximate matches and standardize or replace those messy entries, improving consistency, accuracy, and data integrity for your analysis.

Luckily, Python in Excel includes a fuzzy matching package called TheFuzz. You might be familiar with fuzzy matching if you’ve used it in Power Query, but unfortunately, Power Query isn’t currently very Copilot-friendly. Python in Excel, however, integrates seamlessly with Copilot, making it a great alternative for this task:

Correct spelling errors in the customer_feedback column. Use the TheFuzz library to match words against a list of correct words in the correct_words table. Replace misspelled words with the closest match if the similarity score is above 80 using fuzz.ratio. Create a new column corrected_feedback with the corrected text.

At the moment, Advanced Analysis can’t directly reference tables other than the one you’re analyzing, so you’ll likely need to copy and paste the values from your correct_words table. From there, Copilot can help turn that pasted data into a working DataFrame and handle the corrections. Once done, you’ll have a corrected_feedback column that Copilot should intelligently use as the “source of truth” going forward.

Just keep in mind: fuzzy matching is probabilistic, so it might not always align perfectly with your judgment!

Corrected feedback

Sentiment analysis

Next, we’ll ask Copilot to perform sentiment analysis on the corrected_feedback column, classifying each comment as positive, negative, or neutral:

Perform sentiment analysis on the corrected_feedback column. Classify each comment as positive, negative, or neutral.

Sentiment analysis is valuable because it helps quickly identify customer perceptions and attitudes at scale, allowing businesses to pinpoint areas of success or concern.

While Excel previously offered sentiment analysis through a free add-in, one downside was a lack of transparency: it was difficult to audit or understand precisely how Excel arrived at each sentiment classification.

With Python in Excel, the exact Python code used is generated directly by Copilot, making the analysis transparent, reproducible, and easy to audit or adjust as needed.

Sentiment analysis Copilot

Keyword extraction

Next, we’ll instruct Copilot to pull out important keywords from each comment in the corrected_feedback column, removing common, less meaningful words known as “stopwords” (such as “the,” “is,” “and,” etc.). The resulting keywords will then be placed in a new column called keywords, separated by commas.

Extract keywords from the corrected_feedback column by removing stopwords and output a new column keywords. Separate the keywords by comma.

Keyword extraction is valuable because it distills longer, free-form comments into concise, structured insights. This makes it easier to quickly identify common themes, emerging trends, and frequently mentioned topics, enabling businesses to rapidly understand customer sentiment, issues, or areas of praise without manually reading every review.

Keyword extraction Copilot

Word clouds and frequency visualizations

Next, we’ll instruct Copilot to generate a word cloud from all the feedback text, excluding stopwords. A word cloud visually highlights frequently mentioned words by displaying them in sizes proportional to their frequency. Larger words are mentioned more often.

Create a word cloud from all feedback text, excluding stopwords.

Word cloud customer feedback

Word clouds can be appealing due to their visual impact and ease of interpretation, quickly revealing dominant themes. However, they’re not ideal for precise comparisons or clear numerical insights, as differences in word sizes can be ambiguous. This is why you might prefer a simpler visualization like a bar plot, as prompted here:

Create a bar plot of the most common words in the feedback text, excluding stopwords.

Most common words

When examining the resulting bar plot, you might notice a typo… did you see “grate”? It seems this was intended to be “great,” but fuzzy matching missed it. This highlights the importance of verifying results after automated transformations.

Recap & conclusion

From here, you could consider experimenting further by refining the fuzzy matching threshold to catch more typos, applying additional text preprocessing like stemming or lemmatization for consistency, or exploring more advanced sentiment models tailored specifically to your data or business context.

In this post, you’ve seen how Python in Excel, paired with Copilot, makes advanced text transformations such as fuzzy matching, sentiment analysis, keyword extraction, and visualization accessible and transparent… tasks that used to be difficult, cumbersome, or opaque in Excel alone.

What questions do you have about text transformations with Python in Excel or Python in Excel more broadly? Leave them in the comments below.

The post Python in Excel: How to transform text with Copilot first appeared on Stringfest Analytics.

]]>
15616