regex - Stringfest Analytics

Python in Excel: How to create regular expressions with Copilot

George Mount — Tue, 18 Mar 2025 20:55:32 +0000

After years of being unable to work with regular expressions, Excel has finally introduced a set of convenient and useful functions for doing so:

How to understand regular expressions in Excel

However, around the same time, regular expressions also became accessible in Excel as a result of its integration with Python:

The addition of regex support in both Excel functions and Python integration demonstrates the broad relevance of regular expressions across various software environments, emphasizing their versatile role in tackling a shared array of text-processing tasks.

Yet, to echo an old programmer’s quip—having to rely on regex to solve a problem in the first place might just mean you’ve already got a problem to deal with!

"You decided to use regex, and now you have two problems" — @jwz [paraphrased]

— Command Line Magic (@climagic) November 30, 2020

As this saying suggests, although regex is a capable solution, it often brings a trade-off of increased complexity or unforeseen snags.

Keeping the trickiness of crafting regex in mind, I’d argue that for most Excel users looking to dabble in basic regex tasks, Python is actually the more approachable option compared to Excel’s built-in functions. This is largely because when you ask Copilot for regex assistance, it spits out Python code rather than Excel function syntax (Copilot just seems to play nicer with Python overall.).

Let’s put it to the test with the download file below, which contains some basic text info we’d like to clean up and extract data from based on text patterns (literally regex’s bread and butter!):

Download the exercise file here

If you’ve never tried Copilot with Python, such as in Excel’s Advanced Analysis, check out this post to get the basics and start using it:

How to get started with Advanced Analysis with Python for Copilot in Excel

Let’s kick things off by using natural language to whip up some regex with Python in Excel! I’ll begin by asking for valid email addresses. I’ll tell it where to put the results, switch things up a bit by asking for just the domain instead of the full email addresses, and add that the row should stay blank if no match is found:

Create a new column with the domains of valid email addresses from the Contact Info column. If an entry isn’t a valid email, leave that row blank.

Copilot will now work its magic. keep in mind this is probabilistic, so your results might look a bit different from mine. You might need to tweak the prompt or adjust things, but in the end, you should see those domains pop up in a new column as asked.

The cool part? The code it generates is right there for you to check out and reuse! You can dig into it by looking at the code cell in A42 here, which holds the resulting DataFrame (rows 44-50 give you a preview of the output stored there). If you want more room to explore this generated code, head up to the Formulas tab on the ribbon, click Python, and hit Editor. This will launch a handy text editor where you can view and edit the code.

It’s pretty wild that Copilot built this whole pattern on its own, defined a function, and applied it to your dataset to make a new column… all by itself. Don’t sweat it if you’re not sure you could’ve written out this code as fast as Copilot did… that’s kind of the whole point of generative AI! But if you’ve never seen a function defined in Python or the apply() method used in Pandas, this might be a good time to brush up on some Python basics first, since otherwise—wise you’ll be hard pressed to tweak, explain and understand this code when push comes to shove.

Assuming that Python code didn’t intimidate you too much, let’s jump into another example. This time, I want to search for a partial match—meaning I’m checking if a cell contains the pattern, even if there’s other content in there too. Specifically, in our data, one row of the the Additional Info column has phone numbers mixed in with other text. My prompt this time will be:

Check if a Contact Info or Additional Info entry contains a valid US phone number as part of the entry. Add a new column called ‘Valid Phone’ that marks True for entries containing valid numbers and False otherwise.

Sometimes the difference between exact match and contains is a big deal, sometimes it’s not, and sometimes you’ll need to tweak your Copilot prompt or the generated Python code to nail what you’re after. But in this case, it’s pretty awesome that we can search for a match across two columns and whip up a true/false flag so fast.

Let’s keep the good times rolling! This time, I want to throw in a bit of wiggle room with our phrase matching. Specifically, I’m after valid websites, and I want to catch them in EITHER of the two columns. Plus, I want to snag URLs whether they’ve got protocols—like the https://www part—or not. So let’s fire up another prompt to bring this to life:

Check if a Contact Info or Additional Info entry contains a valid website URL as part of the entry. Add a new column called ‘Website’ with the found URLs. Support URLs with and without protocols (e.g., https://, www.).

And just like that, we’ve got a shiny new column with the results:

I hope you’re getting the hang of this! Let’s do one more example. Regular expressions can often be used to spot sensitive stuff like a social security number or credit card numbers. In this case, we can basically do a regex-powered find and replace to swap that data with masked data. Let’s give it a shot with the credit card numbers in this data:

Identify credit card numbers (e.g., patterns like ‘4111-1111-1111-1111’ or ‘5500 0000 0000 0004’) in the ‘Additional Info’ column and replace them with a masked version (e.g., ‘XXXX-XXXX-XXXX-1111’). Create a new column called ‘Masked Info’ with the result.

Great work! We now have that data masked, and we could even remove the original column if we wanted as we push this downstream so it doesn’t risk exposing any personally identifiable information.

There’s just so much you can do with regex—it’s an incredibly powerful tool, and in my view, it’s way more user-friendly to tackle with Python and Copilot than even with Excel’s slick new functions.

I’ll admit, though, sometimes Copilot doesn’t nail it perfectly, especially if you’ve got more advanced scenarios multiple matches in a cell and want to grab all of them, or just the first or last one, and so forth. But that’s no big deal—you can easily get what you want it by jumping to another gen AI tool like Microsoft 365 Copilot, tweaking the code, and pasting it right back into Excel. That’s why it’s so important to be comfortable and familiar with Python code for these copy-paste tweaks, to double-check your work, and more.

What questions do you have about regex with Python, Copilot, and Excel or this power trio in general? Let me know in the comments.

The post Python in Excel: How to create regular expressions with Copilot first appeared on Stringfest Analytics.

How to understand regular expressions in Excel

George Mount — Wed, 15 Jan 2025 20:14:28 +0000

If you’ve ever encountered the challenge of sorting through data to find or manipulate specific text, you’ll know the potential headaches it can cause. Excel’s integration of regular expression (regex) functions has been a game changer, equipping users with powerful tools to search, extract, and replace text in ways that go beyond the capabilities of functions like FIND() or SEARCH() or even newer dynamic array text functions like TEXTBEFORE() and TEXTAFTER().

Regex becomes crucial when you need advanced pattern matching or more nuanced text manipulation than what simple delimiters and character positions can provide.

As powerful as learning regular expressions can be, I wouldn’t advise spending too much time trying to commit the language to memory. It really is a language of its own, with special characters for matching everything from escapes and new lines to lowercase letters and so on. In fact, regex can be so daunting that it’s the basis for this well-known “joke” among computer programmers: “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.”

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. pic.twitter.com/gUs64zL6rR

— nixCraft (@nixcraft) September 2, 2019

Given these considerations, I won’t delve deep into explaining how the regular expressions in the following examples function. Instead, I’ll rely on generative AI tools and other resources to construct the regex strings, which we’ll then apply directly in Excel.

If your regular expressions don’t work as expected at first, that’s okay; regex often involves some trial and error and iteration. That’s part of the unpredictable nature of regular expressions; they’re tough to make foolproof against every possible exception or exclusion that might come up!

We’ll explore a basic use for each function with simple examples, focusing mainly on finding the first string that matches the regex for each case. Keep in mind these functions come with additional parameters we won’t be using here, like options for case sensitivity and whether to return only the first match or all matches within the text.

To follow along, please download the exercise file below:

Download the exercise file here

This dataset consists of a few cells that contain diverse types of information like email addresses, phone numbers, and dates. Our objective is to employ regular expressions to extract each specific piece of data. This approach is particularly useful in text mining scenarios. For instance, you might sift through medical transcriptions to pinpoint all the dosage numbers recorded or scan video transcripts to identify all mentioned websites.

Testing the expression with `REGEXTEST()`

Let’s begin with the REGEXTEST() function. This function will consistently return TRUE or FALSE, indicating whether the text matches the specified regular expression pattern.

This function is especially useful for validating text data, for instance, verifying if an email or phone number is formatted correctly, or for spotting patterns in text strings that go beyond what simple text functions can manage.

Starting with REGEXTEST() is a smart move to initially verify if your regular expression works before delving into more specific tasks you want Excel to perform with it.

In this example, I will check if each cell contains a valid email address, returning TRUE if it does. At this point, I’m not overly concerned with crafting the regular expression itself. we can leverage various tools for that, including generative AI right within Excel using Copilot (more details on that later!).

=REGEXTEST(A2, "^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$")

Although this function might not appear particularly thrilling, it can be very useful when combined with the FILTER() function. This combination allows you to easily generate a list of all cells where at least one match is found (note that while there might be multiple matches. This setup won’t list them individually — the next functions we cover can be used for that purpose):

=FILTER(A2:A8, REGEXTEST(A2:A8, "^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$") = TRUE)

Extracting data with `REGEXEXTRACT()`

The REGEXEXTRACT() function in Excel is designed to pull out the first match of a specified regular expression pattern from a given text. Unlike REGEXTEST(), which only checks for the existence of a match, REGEXEXTRACT() actually returns the matching substring from the text. This makes it invaluable for extracting specific parts of text based on complex patterns, like extracting phone numbers, dates, or any identifiable sequence within larger text bodies.

In this scenario, we’ll extract valid phone numbers from the list. Note that this regex will recognize phone numbers formatted with either parentheses or periods. If we needed to be more specific, we could adjust the pattern accordingly.

We’re assuming that the entries in rows 6 and 7 are not phone numbers with international country codes but rather dates. As you can see, defining what exactly constitutes a “valid phone number” can be quite complex. Regex requires a lot of precision and really forces you to thoroughly consider the problem at hand! That’s one aspect I particularly enjoy about coding and computational thinking: the need for precision.

=REGEXEXTRACT(A2, "\(?(\d{3})\)?[-. ]?(\d{3})[-. ]?(\d{4})")

One more thing to do here. I really dislike the #N/A errors that are returned here. They’re messy, confusing, and if we attempt any downstream calculations on these cells, like counting populated cells, we’d encounter more #N/A errors.

To address this, we can wrap the function in an IFERROR() to clean it up. I’m hoping future versions of this function will include a parameter for specifying what to do when no match is found, similar to XLOOKUP().

Replacing with `REGEXREPLACE()`

Lastly, the REGEXREPLACE() function is designed to replace text that matches a specified regular expression pattern with another string. It scans through the provided text, identifying matches to the pattern, and substitutes each match with the designated replacement text. In this example, we’ll search for any dates within the cells and replace them with the word “Date”.

=REGEXREPLACE(A2, "\d{2}-\d{2}-\d{4}", "Date")

This approach is particularly effective for data masking. For instance, if you have Social Security numbers or other sensitive information you wish to conceal, you could employ regular expressions to mask them with a placeholder like XXX-YY-ZZZZ.

Regular expressions and Python + Copilot

I mentioned earlier in this post that regular expressions are far from new in computer programming. Most programming languages include built-in regex capabilities, and Python is no exception.

Thanks to Python’s integration with Copilot in Excel, in my experience, it’s often the preferred method over native Excel regex functions when you seek assistance from Copilot to generate regular expressions. This approach works well, but it does require some knowledge of Python and prompt engineering. For example, in this case, I’m going to request all valid email addresses from our original data source:

This approach will filter out only the valid records, similar to what we did with FILTER() and REGEXTEST(). But what if that’s not what we wanted, and instead, we aimed to keep all records while creating a new column to extract the first valid email address from each? Simply use those exact words in your request to achieve what you need!

So, does this mean regular expression functions in Excel just aren’t needed or are out of date? Not really. There’s definitely still a mental barrier or stigma for many users when it comes to opting for Copilot/Python. Perhaps it’s not readable enough for some, or they might not have the necessary tools in their version of Excel. Therefore, the native regex functions in Excel are there for you to use, and they perform very well in the Excel environment.

What questions do you have about regular expressions? Can you think of any immediate use cases? What do you perceive as the trade-offs between using regex in Excel versus regex in Copilot with Python? Share your thoughts in the comments.

Resources

For additional resources and insights into regular expressions and regex functions in Excel, explore the following resources from Microsoft:

The post How to understand regular expressions in Excel first appeared on Stringfest Analytics.

Review: Excel TV’s Data Science with Power BI and R

George Mount — Fri, 12 Oct 2018 15:29:11 +0000

I have had a long history with Excel TV and, like Excel TV (and Excel itself) the channel has changed over the years.

Gone are their regular live-streaming interviews with leading Excel authorities. Excel TV’s main product is now online courses. These are pre-recorded classes taught by the same caliber of talent as that of the interviews.

Excel TV’s first course on dashboards remains among my favorite and I was thrilled to learn that in its latest course Excel TV turns its attention to the world of Big Data and the Microsoft BI ecosystem writ large.

The course, entitled “Data Science with R and Power BI,” is a well-researched course on combining these applications to deliver insight from data.

Course basics

Course instructor Ryan Wade.

This course is taught by Ryan Wade who has over 20 years of experience in business intelligence. It is delivered on Excel TV’s easy-to-use course platform; if you are a student of its other courses, it is easy to navigate between them.

Lectures are delivered over screenshot video with crisp audio and visuals and include a link to download all source code. Ideally this would include source data as well to reproduce the results, but I have been able to modify the supplied code to apply to my own data.

Understanding R’s place in Microsoft’s World

Probably the biggest strength of this course is how it clearly positions R as a tool to use within the Microsoft ecosystem. This pairing should serve as no surprise to astute Microsoft watchers as Microsoft has for years maintained its own distribution of R with Microsoft R Open and for some time has made R visuals available in Power BI.

Ryan makes full use of the Microsoft BI stack from using its R distribution to using the R Tools for Visual Studio development environment to using SQL Server to store data to (obviously) Power BI to present it. There is a lot going on between these various applications and an outside primer on SQL Server and Power BI might be useful.

Pain points defined and accounted for

Every application has its strengths and weaknesses and it appears that by incorporating R so handily into its BI stack Microsoft has tacitly noted some places where R can fill in some gaps.

Ryan does a great job at explicitly identifying and providing examples of these “pain points.” For example, much of the course focuses on mining unstructured text data from the web and on using regular expressions to clean text, weaker points in Power BI.

The course also includes solid introductions to the popular ggplot2 package for data visualization and the dplyr package for data manipulation.

Of course R itself is not without its problems one of which is memory management and capacity. For this Ryan shows how to use Microsoft’s SQL server to overcome this pain point and soon enough you will have integrated R with SQL Server, Power BI and Visual Studio on your computer. This is a very sensible and well-constructed BI stack.

Meeting in the middle

As alluded to before, this course’s curriculum lies at getting you started in data science at the intersection of R and Power BI. I illustrate this with the above Venn Diagram. What I hope to show is that this course is not best suited as an introduction to R or Power BI but rather an introduction to using these tools together (plus SQL Server, I would add). While the course does go into some basic data types in R, novices might have difficulty comprehending the videos and code. This holds true to a lesser extent for Power BI.

At the risk of a shameless plug, for a more comprehensive introduction to R for the Excel user, I suggest (wait for it) my own course, “R Explained for Excel Users.” Here you will get a more brass-tacks introduction to R which will leave you in a better position to tackle more advanced courses such as Excel TV’s.

On the whole, I recommend Excel TV’s Data Science with R and Power BI. The ability to construct data science application using a combination of applications such as in this course is quite powerful and impressive, and the course does a nice job at tailoring a curriculum based on this specific use case.

Ready to get started? Learn more about the class here.

The post Review: Excel TV’s Data Science with Power BI and R first appeared on Stringfest Analytics.

Writing Code to Read Quotes About Writing Code

George Mount — Thu, 11 Oct 2018 23:10:41 +0000

A recent project of mine has been setting up a Twitter bot on innovation quotes. I enjoy this project because in addition to curating a great set of content and growing an audience around it, I have also learned a lot about coding.

From web scraping to regular expressions to social media automation, I’ve learned a lot collecting a list of over 30,000 quotes related to innovation.

Lately I’ve been turning my attention to finding quotes about computer programming, as digital-savvy is crucial to innovation today. These exercises prove great blog post material and quite “meta,” too… writing code to read quotes about writing code. I will cover one of what I hope to make a series below. For this example…

Scraping DevTopics.com’s “101 Great Computer Programming Quotes”

This is a nice set of quotes but we can’t quite copy-and-paste them into a .csv file as in doing so each quote is split across multiple rows and begins with its numeric position. I also want to eliminate the quotation marks and parentheses from these quotations as stylistically I tend to avoid them for Twitter.

While we might despair about the orderliness of this page based on this first attempt, make no mistake that there is well-reasoned logic running under the code with its HTML, and we will need to go there instead.

Part I: Scrape

To do this I will load up the rvest package for R and SelectorGadget extension for Chrome.

I want to identify the HTML nodes which hold the quotes we want, then collect that text. To do that, I will initialize the SelectorGadget, then hover and click on the first quote.

In the bottom toolbar we see the value is set as li, a common HTML tag for items of a list.

Knowing this, we will use the html_nodes function in R to parse those nodes, then html_text to extract the text they hold.

Doing this will return a character vector, but I will convert it to a dataframe for ease of manipulation.

Our code thus far is below.

#initialize packages and URL
library(rvest)
library(tidyverse)
library(stringr)

link <- c("http://www.devtopics.com/101-great-computer-programming-quotes/")

#read in our url
quotes <- read_html(link)

#gather text held in the "li" html nodes
quote <- quotes %>% 
  html_nodes("li") %>% 
  html_text()

is.vector(quote)

#convert to data frame
quote <- as.data.frame(quote)

Part II: Clean

Gathering our quotes via rvest versus copying-and-pasting, we get one quote per line, making it more legible to store in our final workbook. We’ve also left the numerical position of each quote. But some issues with the text remain.

First off, looking through the gathered selection of text, I will see that not all text held in the li node is a quote. This takes some manual intervention to spot, but here I will use dplyr’s slice function to keep only rows 26 through 126 (corresponding to 100 quotes).

We still want to eliminate the parentheses and quotation markers, and to do this I will use regular expression functions from stringr to replace them.

a. Replace “(“, “)”, and ““” with “”

This is not meant as a comprehensive guide to the notorious regular expression, and if you are not familiar I suggest Chapter 14 of R for Data Science. So I assume some familiarity here as otherwise it becomes quite tedious.

Because “(” and “)” are both metacharacters we will need to escape them. Placing these three characters together with the “or” pipe (|) we then use the str_replace_all function to replace strings matching any of the three with nothing “”.

b. Replace “”” with ” “

The end of a quotation is handled differently as we need a space between the quotation and the author; thus this expression is moved to its own function and we use str_replace to replace matches with ” “.

Bonus: Set it up for social media

Because I intend to send these quotes to Twitter so I will put a couple finishing touches on here.

First, using the paste function from base R, I will concatenate our quotes with a couple select hashtags.

Next, I use dplyr’s filter function to exclude lines that are longer than 240 characters, using another stringr function, str_length.

The quote for Part II is displayed below.

#get the rows I want
quote <- slice(quote, 26:126)

#delete the characters I don't want

charsd <- c("\\(|\\)|“")

quote$quote <- str_replace_all(quote$quote,charsd,"")

quote$quote <- str_replace(quote$quote,"”"," ")

#filter lines >240 characters
quote$quote <- paste(quote$quote, "#quote #coding")
quote <- filter(quote, str_length(quote)< 240)

#write csv
write.csv(quote,"C:/RFiles/tech2quotes.csv")

Finally, find the complete code below.

From web scraping to dataframe manipulation to regular expression, this exercise packs a punch in dealing with real-world unstructured text data — and it comes with some enjoyable reading, too.

I hope this post inspires you to tackle the world of text, and I plan to walk through a couple more of these.

The post Writing Code to Read Quotes About Writing Code first appeared on Stringfest Analytics.

regex - Stringfest Analytics

Python in Excel: How to create regular expressions with Copilot

How to understand regular expressions in Excel

Testing the expression with REGEXTEST()

Extracting data with REGEXEXTRACT()

Replacing with REGEXREPLACE()

Regular expressions and Python + Copilot

Resources

Review: Excel TV’s Data Science with Power BI and R

Writing Code to Read Quotes About Writing Code

Part I: Scrape

Part II: Clean

Testing the expression with `REGEXTEST()`

Extracting data with `REGEXEXTRACT()`

Replacing with `REGEXREPLACE()`