Grubhub Bytes - Medium

Back of House: Meet The People Behind the Tech, Product, and Design of Grubhub

Michelle Koufopoulos — Tue, 25 Feb 2025 15:06:01 GMT

Ky Svedlow on the startup days of Seamless, how a prior career as an executive chef influences his approach to work, and the advice he’d give anyone hoping to transition into tech.

I was excited to sit down with Ky Svedlow, UX Writer II, not only because Ky is one of our longest-tenured employees and has witnessed, as he says, “every form a company can be,” but because he so clearly cares about everything — from championing small restaurants in New York City to making the case that you’re right for a role based on your skill set as opposed to a title on a resume. You can view our open roles here!

Ky Svedlow

Michelle Koufopoulos: You’ve been at Grubhub for almost 14 years, which is an incredible amount of time! I know you’ve worn many different hats over those years. Can you talk me through your career path as you’ve made your way to UX?

Ky Svedlow: Food and hospitality have always been central to my life. Before coming to Grubhub, I had spent the previous 10 years working in fine dining and running restaurants as an executive chef. After a kitchen injury forced an early retirement from hospitality, I had to pivot my career. By chance, a friend working at SeamlessWeb reached out and encouraged me to apply to the Partner Relations team they were building. It was SeamlessWeb’s first team that would interface directly with restaurants, so I fit in perfectly.

As you’ve said, I’ve worn a lot of hats, and that was due to the needs of the company. Tech in New York in the 2010s was a space that was changing very quickly. The company saw my strengths, even as an entry-level hire, and moved me over to the Operations side, where we were shipping out tablets to restaurants — our first restaurant-facing product. That’s how I got involved with Product — not by building it, but by actually getting it into the hands of restaurants. I spent years in the Grubhub for Restaurant space where I helped build technical support teams. I did a lot of technical writing for our restaurant guides, along with troubleshooting anything that was merchant facing. That work allowed me to transition into product and UX writing.

My whole career has been blocks building on each other in a natural progression. At Seamless, everything we were doing was novel. Each individual worker was learning at the same time as the company was. I’ve seen us evolve through the startup stage to our first major merger with Grubhub, then COVID, an acquisition with a European food delivery company, and now with Wonder. That’s about as many iterations of a company as one could ever expect to see, just in the last decade alone.

Michelle: Through all of those very different iterations of Grubhub, what’s kept you most excited to be here?

Ky: Supporting restaurants is genuinely important to me. I believe in our mission of creating visibility for these small, mom and pop type of restaurants around the city that previously didn’t have a platform or the means or knowledge to grow their business. As we’ve grown the tools we provide to restaurants, both big and small, it’s been rewarding to see those same restaurants grow and succeed. That mission has really been both my anchor in terms of what keeps me at Grubhub and my north star guiding me in my work.

Michelle: Do you think your background as an executive chef influences the way you approach that work?

Ky: Coming here, the only foundation I had was my executive chef life and the chef methodology to approaching work, which I quickly found out is easily applied to business. A professional kitchen is a beautiful, choreographed dance and if anything breaks then that falls apart and the whole show stops. The chef methodology centers around mise en place — meaning, everything in its place. It’s how you set yourself up for a proper kitchen flow with high efficiency. For example, it’s as simple as having your onions chopped and all ingredients prepped, so that when you’re ready to pull the trigger and start cooking, you have cut out the prep work and can get down to creating. I take the same approach to my workflows.

In my current role as a UX writer, I have to move across all the different verticals that we cover, and I’m grateful that I’ve been in positions where I get a 30,000 foot, holistic view. It’s easier to pull back from the grill and look at the whole scene and see what’s broken. I look at the entire job as it’s just cooking.

Michelle: What’s the project you’re proudest of?

Ky: Launching Grubhub for Restaurants (GFR), our merchant-facing product, in 2010 was a huge deal. Prior to that, we were relying on restaurants using old desktop computers or fax machines to receive orders, which had many possible points of failure. GFR still lives on today and continues to evolve as merchants’ needs do, but I was there for day zero, boots on the ground, going restaurant to restaurant meeting people. That was a very fulfilling time.

Michelle: Do you have any advice for those looking to break into tech who, like you, may not have a “conventional” background?

Ky: I try to get people not to think about the titles or roles that they’ve had, but to focus on their skill set and how they can reapply those skills to a different job. I came into the company on paper having no business being in tech. You don’t have to have been a marketer or a product manager. These are positions that didn’t exist fifteen years ago. Once you’re in that seat having the interview, it’s up to you to sell yourself that you’re right for the role. Everyone’s the storyteller of their own life.

Michelle: What’s something that we might be surprised to learn about you?

Ky: I own 500 pairs of sneakers. Over 500. I lost count.

Michelle: Whoa. You live in a Brooklyn apartment! Where do you keep 500 sneakers?

Ky: They’re very strategically hidden, but mostly in my bedroom. I have these two giant metal kitchen racks. I also have 700 VHS tapes.

Michelle: That is quite a collection! Why VHS?

Ky: As a child, walking around the video store was really my first art gallery, and VHS art was always fascinating to me. I also love that VHS’s degrade over time. Every time you watch a tape, that’s the only time that tape is going to look like that. The next time you look at it, it’s going to change. It’s going to be different. They’re kind of alive. Kind of like a bottle of wine.

Michelle: Sneakers, VHS tapes — anything else?

Ky: Kitchen knives, candles, computers. I still have all my iPhones, going back to iPhone 1.

Michelle: Is this actually very intentional hoarding?

Ky: It’s kind of a combination of doom boxes and I just don’t want to let my things go.

Back of House: Meet The People Behind the Tech, Product, and Design of Grubhub was originally published in Grubhub Bytes on Medium, where people are continuing the conversation by highlighting and responding to this story.

Back of House: Meet The People Behind the Tech, Product, and Design of Grubhub

Michelle Koufopoulos — Tue, 18 Jun 2024 17:26:51 GMT

Sayoko Yoshida On Our Vibrant Design Culture, User-Centric Focus, and How Grubhub Transformed the Way She Thinks About Food

Sayoko Yoshida, Director of Design

We recently caught up with Sayoko Yoshida, director of design at Grubhub, to discuss the evolution and philosophy behind Grubhub’s design team. Sayoko reflects on the user-centric approach that drives all design work, the dynamic growth of the team over the last decade, how her work has impacted how she thinks about food and design on a personal level, and the exciting projects that are shaping Grubhub’s customer experience.

As the team continues to expand, we are excited to announce two open design roles, offering a unique opportunity for hungry professionals to make a meaningful impact to our team and products at Grubhub!

Michelle Koufopoulos: Can you talk me through Grubhub’s design philosophy? What makes us special and how has that evolved over time?

Sayoko Yoshida: The number one goal for the design team is to do right by the customer and address their needs. User-centricity is at the core of how we operate. We are always focused on creating the best possible customer experience, whether for Consumers, Merchant Partners, Couriers, or our Care team. We believe that by building long-term relationships with our customers, we can drive business growth as a result.

Our design organization consists of a variety of functions, including Product Design, System Design, UX Research, UX Writing, and Illustration. Over the past decade, I’ve witnessed these functions evolve and grow, contributing to the innovative and user-centric solutions we create.

As far as what makes us special, it’s our people. Recently, we had a design offsite, and it was incredibly meaningful to be together in one space to exchange ideas and collaborate. The diversity and talent within our team is remarkable; each member brings unique expertise and perspectives that enrich our work and design culture at Grubhub.

M: A decade is a long time to be at a company in this industry!

S: Yes! It’s been really interesting to see the evolution of the company, as well as the design team, over time. When I started, it was just Product Design and UX Research, and the teams were quite small. We have matured as an organization, and our team members now have opportunities to grow either as individual contributors (IC) or as managers, thanks to the leveling system in place.

M: Whenever I walk by the Product Design team in New York, it’s really clear that there’s real camaraderie and dedication there — you all seem to genuinely enjoy working with one another

S: Thank you for calling that out. We have an incredibly high retention rate in Consumer Design — close to 100 percent for three years now. The team will even hang out with each other after work. I think we’re very fortunate to have a high caliber of talent and a strong design culture, where we all really care about what we do and get along well with each other.

M: Do you have a favorite project you’ve worked on over the years?

S: Our collaboration with Amazon has been exciting, especially having just announced that Amazon customers can now order on Grubhub without leaving Amazon.com or the Amazon Shopping app. The project started over a year ago and required a company-wide cross-functional collaboration. It’s especially exciting for merchants who now have greater visibility given Amazon has tens of millions of customers.

Another favorite project of mine was the redesign of our native app, which required close partnership with tech and product. There was a month of brainstorming, where everybody from the design team contributed, and we imagined what the new app would look like.

The level of collaboration and unity across the PDE organization is something else that really makes us special. Over the years, the design team at Grubhub has earned the organization’s trust by demonstrating that we can drive business through a customer-centric approach. Earning that trust provides the opportunity to influence strategy.

M: Speaking of influence, has working within Grubhub for so long made you think differently about any aspects of food or design?

S: Oh, definitely. When I first joined, I just saw food as sustenance and didn’t pay much attention to it beyond that. Then I started designing menu pages, learning about the taxonomy of food, and began to realize how rich food is in culture, how it’s tied to language and history, and how much the ingredients matter.

I started collecting cookbooks because I wanted to understand how food is presented visually — there is a visual language of food. I was intrigued by how to use photography effectively, make food appear the most appetizing, and present it attractively. From there, my interest in cooking grew in my personal life.

In my current role, my focus has been on developing a strategic vision for the design function that aligns with the company’s goals and objectives. This involves understanding market trends, user needs, emerging technologies, and business goals to create a cohesive design strategy.

Additionally, I put a lot of thought into designing the team structure, ensuring that people with the right skill sets are in the right teams so that we are set up for success.

M: Is there anything else you’d like folks internally or externally to know about design at Grubhub?

S: We currently have an open Senior Design Manager position, in Loyalty and Monetization. The Loyalty team focuses on building long-term relationships with Grubhub+ members by continuously improving our offerings and bringing value. The Monetization team develops an advertising platform that enables our partner brands to increase their visibility on the app, reach more customers, and grow their business.

As we’ve discussed, the design team collaborates closely with Product and Engineering, offering many opportunities to make a strategic impact on our products. Additionally, from a career development perspective, there is significant room for a design professional in this role to grow individually.

Forecasting Grubhub Order Volume At Scale

Hugues Demers — Thu, 19 May 2022 19:02:02 GMT

By William Cox, Gayan Seneviratna, and Hugues Demers

At Grubhub, we’re continuously working to provide the best experience for diners, and key to this experience is getting your food when you expect it. So how do we make that happen?

The first step is to plan well in advance so we can make sure there are enough couriers available to pick up and deliver orders. If there are too few couriers on the road then diners are left waiting for their food, which understandably makes those diners unhappy. On the other hand, if there are too many couriers on the road then some couriers will be left idle, and we’ll have to pay those couriers at a lower rate than what they’d make doing active deliveries — something that makes both Grubhub and our couriers unhappy. So in order to thread this three-way-needle-of-happiness, as it were, we run forecasts of how many orders we’ll receive each half-hour for each region we serve. This amounts to millions of forecasted timeslots, and we do this daily!

Image 2 — Illustration of block scheduling. Couriers choose to work blocks of time to cover all forecasted demand for a region

We schedule our couriers through self-sign ups of specific blocks of time. This sign -up system ensures our couriers are still eligible for base-pay even if they don’t receive any orders, while also allowing them to better plan their days and income. This system is ideal for us as well: we’re better able to control service levels since we know how many couriers will be available at any given time.

How we go about accurately modeling region demand is discussed in the previous post in this series, while this post focuses on how we take that research and put it into production.

Image 3 — Model training takes in many input sources to produce long-term forecasts

Research to Production Cycle

The ability to tightly couple research and production is essential to an effective forecasting system. Because of this, all model development is done on historic data, and all research code is held to a production-level quality standard by our team. At Grubhub, data scientists function as machine learning engineers to support and maintain models from innovation to implementation, and occasionally deprecation.

Grubhub Data Platform

The order volume forecasting system lives inside the broader Grubhub data platform infrastructure. This includes AWS EMR for distributed computation, Azkaban as the workflow scheduler, PrestoDB for querying and joining data sources, and S3 as the underlying datalake. We also make extensive use of Sqlite for local data caches on our worker nodes. We deploy our product as a Python package that is scheduled via Azkaban to run daily. Daily Azkaban jobs bootstrap an EMR cluster, install the forecasting package, and start the forecasting job.

Forecasting Infrastructure

We set out with several goals in mind when designing our forecasting system:

Prefer a python(ic) implementation
Prefer simplicity
Prefer local testing / distributed deployment
Make research and deployment as seamless as possible

Because it is essentially the standard language of machine learning, all research and development were done in Python. We judged that the complexity of reimplementing research models into a different language like Java, or importing the prebuilt models into a separate production environment would be too high, and would reduce the rate at which we could iterate on those models. Additionally, we chose to use Dask over Spark as the distributed computation framework largely because it was native Python and allowed for a Pythonic implementation.

At all points in our process, whether it be model development, model deployment, or monitoring and alerting for our product, we try and evaluate whether we’re choosing the simplest solution to a problem. This is a continuous discussion, made easier by having an integrated team that owns the entire forecasting product.

Finally, when designing the system, we wanted the ability to work locally as much as possible. Since our historic data is independent per region and generally fits in memory, it is possible to work locally with a broad selection of forecasting data from a number of representative regions. Dask also makes it easy to deploy distributed work locally using multiple Python processes in a way that is nearly identical to how full production load is distributed among YARN workers across compute nodes. In practice, this means we can move from local testing on a few regions with a few models, to full production load across all models and all regions by simply changing some command-line parameters.

Image 4 — Parallel Forecasting for many regions and many models

Preparing the Data

Restaurants are grouped into regions based on density and business needs. These regional groupings can change over time, thus each daily forecast production requires creating the entire historic feature-set for all current region groups. This means that for every region, for every day that region has been serving deliveries, we have to recompute all the order volume features; a computationally intensive task that must be completed before forecasting. This step can only be completed after all the previous day’s data has been collected and aggregated by upstream systems. The feature-set is then cached for local forecasting workers to use, along with any GPU-based workers. This daily data cache makes local research much easier as the complete historic feature set is always available. Even though this dataset continues to grow day-by-day and region-by-region, it is still relatively small from a data standpoint, meaning it can easily fit into memory on a worker machine, or be downloaded locally for research purposes.

Dask for Distributed Computation

Generating forecasts for hundreds of different regions for each of our various models breaks down nicely into thousands of parallel forecasting processes. Using Dask and YARN, we distribute this computationally intensive work to hundreds of workers spread across our EMR cluster.

Dask was an easy choice for this since it provides a familiar API, scales well across the cluster but can also be used locally for distributed computation, was integrated with our existing Python codebase, and allows us to design complex applications that can add and remove workers as needed.

Assuring Forecasting Quality

Image 5 — Example forecasts

We have several levels of safeguards in place at Grubhub to help validate each forecast we make. First, we do basic checking to ensure we haven’t produced negative forecasts or all zero forecasts. Second, we compare each forecasted day to the previously forecasted days, so we can detect when a model fails to converge or when upstream data issues cause wild variations in a forecast. Third, we do basic historic projections for each region so we can detect if a forecast suddenly moves by a large amount. These checks also validate every time a holiday arrives, as events like Black Friday or New Years Day usually see wild fluctuations in order volume.

We also use an array of industry standard tools to detect and monitor the health of our forecasting cluster: tools to collect metrics around timing and memory usage, log aggregation, workflow scheduling, failure alerting services, and messaging for convenient updates on each day’s forecast and its rollout.

Finally, we always have a basic backup model — usually a simple average of the past N days — that is computed for each region. Therefore, we not only have a performance baseline to measure our improvements in forecasting, but also have some guarantee of a reasonable forecast for each region even if the other models fail.

Conclusion

Forecasting accurate order volume is vital to the happiness of our diners and couriers alike. It helps us to anticipate demand and allows couriers to schedule their time well in advance.

Forecasting hundreds of regions accurately over a multi-week time-horizon is a difficult task, both in research and implementation. Because of this, we have a tightly integrated forecasting team that handles not only the forecasting research and development but also the production deployment and maintenance of our systems.

In short, next time your driver shows up to your door with your dinner on time, you can thank — among many many others — our hardworking forecasting team for the work they did several weeks ago!

Forecasting Grubhub Order Volume At Scale was originally published in Grubhub Bytes on Medium, where people are continuing the conversation by highlighting and responding to this story.

Systems Thinking and Grubhub: They Go Together like Chocolate and Peanut Butter

Lauren Stolzar — Mon, 08 Nov 2021 15:28:05 GMT

What do an Airbus A380, a pacemaker, and Grubhub all have in common? All three utilize teams focused on systems thinking, systems engineering, and systems analysis to support development and help companies understand how individual pieces of their product contribute to the whole. On the surface, Grubhub may seem like the odd one out in this example — we are primarily a software system, and in most cases, if we make a mistake the consequences do not include risking someone’s life. However, when you look a level deeper, some similarities and patterns emerge.

In all three cases, people are critical to the system’s success, whether it’s the pilots and mechanics of the A380, the surgeon and nurses who attach the pacemaker, or the restaurants, diners, and drivers who bring life to our platform. There are also hidden complexities that impact how things work: planes need to remain in the air even when the weather gets rough, pacemakers need to work regardless of underlying medical conditions, and diners are still hungry even when a storm rolls in. While (thankfully!) we have fewer cases where something breaking could cause a safety issue, the principles and approaches for dealing with all three problems are remarkably similar. This is where Systems Thinking comes in.

What is Systems Thinking?

At its core, Systems Thinking focuses on the whole¹: it isn’t about the individual pieces, but how they all work together. Whereas many disciplines focus on having significant depth in a specific area, Systems Thinkers have to be able to think paradoxically, focusing on the big picture at the same time as they look at the details.

What does all of this mean for Grubhub? We have two teams that focus entirely on Systems Thinking: a Systems Engineering team that helps define how to set ourselves up for success, and a Systems Analysis team that helps define just what that success means and how we measure it. The members of both teams have a wide range of experiences (e.g., product managers, statisticians, military veterans, software engineers, and restaurant workers), which in turn allows us to examine problems from multiple perspectives. These teams tend to focus on problems that cross multiple product areas either directly (via integration of capabilities) or indirectly (via network effects).

How are these teams different from other teams at Grubhub? Primarily, because of our breadth of focus. We are fortunate to have incredibly talented engineering and product teams who ensure that our software system has the features that our users need in specific areas. By taking ownership of places that involve multiple user types, or multiple products intersecting (such as how we make Grubhub the premier platform for drivers), we can help make space for those teams to do what they do best.

While there are specific techniques that are used in Systems Thinking², we’ve worked to help our colleagues bring Systems Thinking into their thought process more generally, without always having to rely on our two Systems Thinking teams. We’ve done this by working with directors and technical leads to help them understand how we approach problems so that they can bring that approach into their spaces. Through this work, we’ve established a few honorary team members and encouraged a holistic approach to problem solving within the company.

How does the Systems Thinking team contribute to Grubhub in practice? Imagine what a perfect delivery looks like for a moment. While it may be some time until we all own machines able to bring anything into our homes just by asking (“Tea, Earl Grey, Hot”, anybody?) ideally, in our current world, a driver would arrive at a restaurant exactly as the food is being packaged. That way, the food wouldn’t have to sit and get cold while waiting for a driver to arrive, and drivers wouldn’t be stuck waiting at restaurants.

On the surface, this should be an easy problem to solve: just make sure a driver is arriving at a restaurant just as the food is ready. However, once you start to peel back its layers, this seemingly straightforward question becomes much more complex.

Different restaurants require different lengths of time to prepare food, so in most cases we need the restaurant to be preparing food before we have a driver assigned. Pre-assigning drivers might help, but this would need to be done in a way that doesn’t slow down our overall delivery process and doesn’t result in drivers having to wait at restaurants.

As Grubhub looked at this specific portion of a delivery, it became clear that this problem was at once both highly impactful to our restaurants, drivers, and diners, and did not have a single “easy” solution.

This is where Systems Thinking comes into play. We joined a team of Product and Tech folks to start working on understanding how often our restaurants or drivers are waiting, what potential long term options we had, and what short term gains were possible. Based on our understanding of the problem space, we developed a plan of attack to help us understand the unknowns and determine how we might approach system changes. We identified a set of experiments that we could run without significant software changes that would allow us to better understand what was happening. In parallel, we began to define what needed to happen at a system level to improve the experience, and define what success would look like. From there, we defined gaps and possible paths forward given our current architecture. Finally, we developed multiple alternative design options and conducted trade studies on all of them to help Grubhub determine the best path forward. The team used these studies to figure out a solution that was feasible in the short term and provided enough room to grow and expand in the long term.

Another area we’ve focused on is determining, through a combination of analysis, simulation, and experimentation, if a system is doing what we expect it to do. Grubhub has a robust A/B test framework that works very well for decision making in cases where the effects are limited (e.g., do individual users respond to a feature?). However, on the logistics side of the company,, because we are trying to change something at the network level (e.g., changing how we prioritize orders for dispatch), A/B testing is insufficient. For those capabilities, we need to look at things a little differently, utilizing techniques such as switchback testing. As there is the potential for many things to impact network-level behavior (storms, parades, holidays, posts on social media), our team has put significant effort into working with the wider org to both specify a testable hypothesis and understand the results. As this practice has become more routine and we’ve helped define what is and is not testable, the Systems Thinking team has helped grow Grubhub’s experimentation capabilities. We’ve done this through a few different paths: making sure the data necessary for analysis is readily available, defining what a “good” experiment looks like so that the broader group can more easily assess whether an experiment is “good”, and making the analysis of experiments more routine so that teams have quick and easy access to the results.

What’s next for Systems Thinking at Grubhub? When our team was originally formed, we focused on analysis within specific areas that had broad systemic impact (for example, the system that estimates how long an order will take to get to your door). As we — and Grubhub — grew and matured, we began focusing on projects that were even broader, covering multiple system areas and helping to identify areas of significant opportunity and risk. We’re an agile, creative bunch, but our team is also small, so our next step is scaling our capabilities so that we can cover a larger swath of the organization. How are we doing this? In part by looking at our tooling and processes, building out features and processes to significantly speed how we answer common questions and extend core capabilities. We’re also standardizing our outputs and acting as reviewers for our colleagues, with the eventual goal that anyone at Grubhub be able to apply Systems Thinking into their problem space.

[1] Boardman, J., Sauser, B. (2008). Systems Thinking: Coping with 21st Century Problems. United States: Taylor & Francis.

[2] https://medium.com/disruptive-design/tools-for-systems-thinkers-the-6-fundamental-concepts-of-systems-thinking-379cdac3dc6a

Systems Thinking and Grubhub: They Go Together like Chocolate and Peanut Butter was originally published in Grubhub Bytes on Medium, where people are continuing the conversation by highlighting and responding to this story.

Bringing Caregivers Back to Tech: Grubhub Launches First Returnship

Michelle Koufopoulos — Mon, 26 Jul 2021 17:01:58 GMT

@ergonofis via Unsplash

Diversity, Equity, and Inclusion at Grubhub — in our technology organization and across the company as a whole — are values we’re always thinking about and working to improve upon. Over the last few years, we’ve removed subtly biased language from our systems, processes, and job descriptions; diversified hiring panels; launched seven GrubConnect groups. And, just this past April, we were proud to launch our Grubhub Reconnect Returnship Program in conjunction with Path Forward, whose mission is to empower individuals who are looking to return to the workforce in technical roles after taking time spent as a caregiver, with the tools they need to succeed.

Grubhub’s Reconnect Returnship Program is a 16-week paid returnship for experienced professionals reentering the workforce after spending at least two years off caring for their families.

Earlier this summer, I sat down with Padmaja Ayyagari (Grubhub’s Team Lead for DEI efforts), Caryn Drange (Director of Product Management) and Gargi Gupta (Software Engineer II) to discuss how the returnship came to be, what the biggest takeaways have been, and where they’d like to see the program go in the future.

Michelle: Can you tell me a bit about the origins of our Reconnect Program? What was the process of launching the program like?

Padmaja: The first time I heard about the idea of a returnship program was back in 2018 at the Grace Hopper conference. At these conferences sometimes you tend to drift off, but this was a very engaging conversation. As I listened, I realized that this is an opportunity and a space where Grubhub can invest.

I was excited because I personally experienced struggles reentering the workforce. Back in India, I worked, I had a career, and then I left all that behind when I joined my husband here in the United States. I was a new bride, had a baby I was raising, and I ended up staying home for seven years. I was happy, but I knew there was something more that I wanted. But when I started looking for work again, I did not get a job easily. Eventually, a friend of a friend reached out and said ‘hey, there’s an opportunity I have for you — someone is looking for help with recruiting.’ At that point, I had no idea what recruiting was! I just landed in it. When I found out about Path Forward’s Returnship Program, I wished this kind of program existed 12 years ago. I realized there are so many other women, other people like me, who are looking for a program that helps them get back into the workforce, and this is an amazing opportunity to bring them back.

The idea of a program like this had been discussed a few times, and when we founded the Tech Women Connect ERG last year, Caryn Drange and Gargi Gupta reached out to me, we talked about returnship program and we decided it was the perfect time to pursue it, especially given the disproportionate amount of women who had been impacted professionally by COVID-19. We pitched the idea of this program to leadership once more, and the rest is history.

Gargi: I’m a mother of two, and I have friends who left the workforce to take care of their young kids who are now exploring their options to return. It’s been really challenging for them because of how employers perceive the gap in their resumes. These women worked before; they had just taken a break to care for their families. This program gives women a supportive transition back into the workforce and it’s something I think every company should consider implementing. Personally, I struggled when my kids were little and I was making a career transition from QA into development, and I know I would have benefitted from a program like this.

Caryn: I’ve been in tech 20 something years, and at Grubhub for almost nine. I worked in financial services for over a decade, and I’ve always been the only woman in the room. Always. Even in college, I was the only woman in my computer science classes, and it was always a challenge to be the only woman. How do you get other women to be in the room with you? This question has always been a passion of mine.

For many years, Grubhub was really small, there was only so much we could do. I think we started scratching the surface a bit when we started our college recruiting program. We were like, let’s get them early! Catch the women early, let’s find them early.

When I was nominated with Gargi as a co-lead for Grubhub’s Tech Women Connect ERG, I already knew that I was going to go talk to Padmaja about implementing a similar program for tech roles. I didn’t know that you [Padmaja] had transitioned into a DEI role, I thought you were still just doing recruiting, though now I know you do both! You had told me about this Path Forward program and obviously, given my passion for women in STEM, I read a lot of articles and one of the things that always bothers me — and it’s not just women in tech — is how do you get women up the career ladder at higher level positions? A lot of women opt out. Or, they don’t even opt out — they go to other companies because they feel like they’re hitting a wall in their progression. When you told me about Path Forward and its Reconnect Returnship Program, I thought, this is awesome, and we can really sell it [to executive leadership]. And ever since we’ve kicked the program off, it’s been a great experience. I know we’re only scratching the surface with the initial pilot program, but this is another way to bring in talented women at a pace that supports our company’s growth.

Michelle: What was the interview process like for the returnees? Were there takeaways where you thought, oh, we could be doing this in general when we interview candidates?

Caryn: We tried to leverage a mash up between our current process and our internship process. We had a remote coding session, then we did the normal panel of coding, design, and 360. Now we’re working on a doc about how to take what we learned from this and potentially make a specialized interview process.

I think our biggest learning and takeaway is that we didn’t set a clear expectation, or bar, on the interview process. Padmaja and I really had to take the feedback from the interview panels and sometimes challenge it with the hiring managers — like, is that a fair bar? How do you interview for trainability? Just because someone doesn’t have actual skills, what’s transferable? And I think there is an opportunity for our tech org to define those expectations clearly. And as we do that, I think it will make our interview pipeline and process go a lot smoother.

Padmaja did a lot of handholding with the managers to get them to think outside the box. It’s a little bit of a cultural shift. And I think with that cultural shift, it doesn’t mean that we’re lowering the bar. Not at all. It’s just that the bar is defined a little differently.

Ultimately, tech changes so quickly that everyone is learning constantly. By developing this approach to interview for trainability, that’s an amazing pipeline to define.

Michelle: What’s the ideal outcome at the end of the 16 week program?

Padmaja: The ideal outcome is to transition these women into full time employees, and based on their overall performance at the end of their 16 weeks, we’ll be leveling them.

Currently, we are partnering with Path Forward for multiple reasons. The first is that we are very new to this space and haven’t run a program like this before. Path Forward is a leader in the space and has already built out a very robust and scalable framework. They also have outreach and networking capabilities which are very valuable. Ultimately, our goal is to take what we’ve learned through this pilot and expand this program into other departments across the company.

Gargi: For Grubhub, we are hopeful that these women will transition into full time roles at Grubhub. Regardless of whether these women receive and accept full-time offers, they gain valuable experience and will have these 16 weeks under their belt and the name Grubhub on their resume. This experience will count towards their future work.

Caryn: This program is a good thing for our tech organization. It gives people that aren’t necessarily managers experience with coaching and mentoring others and building out that skillset, which is tough to do without actually doing it. You have to learn by doing. And these types of programs give experiences to build leadership skills. I think we don’t talk about that enough. Sometimes we worry more about the overhead cost / time of ramping up a new candidate who has transferable skills vs. the cost / time of hiring for a candidate that already has the exact skills/experience we need. With this pilot program, we will be able to demonstrate the cost / time of returnship hiring vs. the traditional hiring pipeline.

Padmaja: We talk about fungible skills. What are fungible skills? Well, these are fungible skills. These returnees are fungible. They are super adaptable, they are trainable, they are easy to coach, and they are very flexible. They can learn anything and everything. They have so much grit and determination, and they push through. I’ve spoken to at least 150 individuals and every single candidate surprised me. This returnship has given these women an opportunity to feel confident again. I still remember reading Caryn’s article about Imposter Syndrome — that was the first time I’d heard about it. I said to myself, 80%, 90% of women go through this, and we do not realize it, and we do not talk about it. I feel like this program gives a platform for us to air this out. I wish this program existed twelve, thirteen years ago when I was looking for a job!

Caryn: Padmaja, what you said resonated with me and reminded me of something Bob Waite [VP of Engineering] said when we first presented the program to leadership. He said, “Gosh, I wish this was around for my mom. She was a computer programmer, she took time off to have kids, and she couldn’t get back into the workforce, so she had to go and be a teacher,” because that’s what your choices were. That’s what my mom had the choice of being. A teacher, a receptionist, or a nurse. That’s it. Those were the three choices if you wanted a career. Once you took time off, those were your options [at the time]. It was interesting to see our leadership say that they wish we had this concept decades ago.

Padmaja: Definitely. And this program would not have happened without senior leadership’s support across the board and our Tech Women Connect ERG. If they hadn’t championed the project, we’d still be debating it. Every single person who has touched this program has been a really great support in many different ways.

Fourteen weeks into the returnship, we’re very excited with the success of the program thus far and are looking forward to having our returnees finish out their final weeks.

Grubhub is always looking for talented professionals from a wide variety of backgrounds. Learn more about our open positions.

Bringing Caregivers Back to Tech: Grubhub Launches First Returnship was originally published in Grubhub Bytes on Medium, where people are continuing the conversation by highlighting and responding to this story.

“I See Tacos In Your Future”: Order Volume Forecasting at Grubhub

Gayan Seneviratna — Thu, 24 Jun 2021 17:16:51 GMT

Photo credit: Michael Dziedzic @lazycreekimages

By William Cox and Gayan Seneviratna

Do you ever wonder how Grubhub is able to assign you a delivery driver in less than a minute? It’s not magic — it’s because we’re always working to ensure there are just enough drivers in your area for all the orders that people are likely to place in a given moment. Alright, you might say. But how can you be sure of how many orders that’s going to be?

Photo credit: Elena Koycheva @lenneek

No, it’s not quite that simple.

Today, we’re going to be talking about Demand Forecasting — or, as it’s known at Grubhub, Order Volume Forecasting (OVF). Our goal is to guess how many orders there are going to be in a given area, over a half-hour period. That might seem like a pretty daunting prospect, so let’s start small.

Let’s say you want to guess how many orders there are going to be in Manhattan tomorrow. If that’s the only information you have — a city and a day — you’re going to be shooting in the dark.

But let’s say I told you that yesterday, Manhattanites placed 200 orders. You’d probably guess there’d be around 200 orders again tomorrow. Or maybe you know that tomorrow is Friday, and you know that people order more on Fridays. Maybe you’d increase your guess to 300.

Congratulations, you’ve just performed your first volume forecast! That’s essentially what we do here at Grubhub: use historical order volume, combined with additional predictors, to create accurate and flexible forecasts for our business.

The Data

In practice, we work with much more information than just yesterday’s orders. Our forecasting is built on the entire history of Grubhub orders. In order to understand our data, it’s helpful to take a step back and see what order volume data looks like.

In a standard statistics problem, your unit of analysis (i.e., the thing you care about) is independently and identically distributed, or IID. For example, if you were trying to build a model relating ingredients to food names, your data might look like the following. Above is your data table, and below are your values.

IID means that the order of the rows doesn’t matter. If you guessed that the second item was fondue, it would not affect whether or not the fourth was chimichangas. You could also say that there is no covariance between units of analysis; covariance means that if two values are different in one variable, they’ll be different in others. They vary together.

The data we use in OVF is known as a time-series. It might look something like the following:

Time-series have a time-dependent structure. That is, if you change the order, you change the data’s meaning. Furthermore, the values depend on each other; to refer back to our first example, the number of orders placed tomorrow is related to the number that is placed today.

This dependency means that much of the traditional machine learning methodology needs to be tweaked: everything from building a training data set to your choice of model to generating predictions.

Regions

Now that we’ve discussed one time-series, let’s broaden that. Grubhub doesn’t only operate in Manhattan, of course — we have a presence in hundreds of different US markets. These regions are all sized and shaped by customer behavior. Customer behavior varies from market to market. As a result, time-series across these regions all look very different.

Some regions have only a few orders a day, while others have thousands.

Some time-series begin in 2015. Other regions started deliveries as early as last year.

And some regions experience major fluctuations around the holidays, while others hold more steady.

(We realize the y-axes are unlabeled; we can’t share actual business data without hiring you. Nevertheless, these plots demonstrate the variety of our data.)

In order for our forecasts to be successful, we must create a prediction for each time-series. This requires a flexible approach.

Predictors

By now, we’ve established the main input to OVF: hundreds of order volume time-series. But we also make use of other factors, some of which are intrinsic to the time-series itself. For instance, in that first example, we used the day of the week as a predictor.

Other predictors are called “exogenous.” One example of this is whether or not a given day is a holiday. Most diners are eating home-cooked meals on Thanksgiving and Christmas, so it’s important that we predict fewer orders on those days. We use one-hot encoded variables to identify holidays. That is, for a given holiday variable, the value is 0 unless we’re on that date:

The weather would be another example of an exogenous predictor that we consider, as operationalized by variables like snowfall (in inches) or temperature. We then take this information and aggregate it over the time-scale we care about. For instance, the average temperature over a day.

Note that because of the nature of our data, all of these variables are also going to be time-series themselves. Taken together, the data from Manhattan might look as follows:

Modelling Considerations

By now, we understand our data both conceptually and visually. If you haven’t done time series forecasting before, you might be tempted to dive into the problem using a classical research process. Maybe you’d start with linear regression between time and orders, gradually adding additional predictors or complicating the model you’re using.

But as we mentioned earlier, the nature of time-series data, as well as the implied dependence structure, changes how we approach modelling. The following are decisions we must take into account before we begin time-series forecasting.

Backtest Horizons

First we need to divide our data into training and test sets. In a standard modelling scenario, we would randomly sample the data in some ratio. We’d determine model weights based on the first set, then see if they performed well at predicting the values of the second. Let’s see what that looks like for a single time-series:

This is inadvisable, because it isn’t realistic forecasting — you often have all your data up to a certain point and need to predict the next values. The graph above looks more like imputation; we want extrapolation.

The ability to compare against real values is key to training. But how are we supposed to test our model’s ability to predict the future if we don’t know what the future will be?

The solution is backtesting. We begin by selecting a point in our past. We train our models on all data up to a given time point. Then we use the model to forecast a horizon outward, say 14 or 21 days out. Crucially, all of these days are known values in our dataset. Finally, we can calculate performance metrics like MAE or RMSE on the residuals (the differences between the known and forecasted values).

Backtesting allows us to compare various models for accuracy, and decide on the best methods to use in production. We can then use the same procedure from the above plot when we forecast- but we’ll be predicting order volumes that haven’t happened yet.

Extending Predictors

You might have noticed that in the above plot, the weather predictor extends into the training region. When we forecast new values of the target time-series (order volume), we need to have the predictor values ready at that new time point. For holidays, this is easy — you can just look up the desired date. But for variables like weather, we have to rely on weather forecasts, meaning that our forecast also uses forecasts.

Local vs. Global Models

As mentioned earlier, our data isn’t really a single series, but many different time-series. One of the first questions we ask in time-series modelling is whether we should build our models locally (i.e. one model per region) or globally (across all regions).

Traditionally, local models were the clear choice. These models are able to account for individual patterns in each region. One region might have more orders on July 4th, while another has less. Local models can have appropriately different weights for that predictor. Local models also train relatively quickly — with approximately 2,000 days (5 years) per region, most algorithms can optimize weights in less than a minute.

Global models are meant to forecast the future of any time-series in a dataset, because they are trained on all the series in that dataset. This approach has its advantages: the model may be able to learn overall trends in data, like how regions generally react to offered promotions or discounts. This advantage has often been eclipsed by the specificity of local models. However, recent advances in Deep Learning have allowed for better global models. Examples include DeepAR and Transformer models.

Autoregression vs. Time-Indexed Predictors

Generally speaking, there are two ways we account for the time-dependent nature of time-series data.

The first method is autoregression. Autoregression(AR) refers to the use of recent past values to forecast. For example, a simple AR model might forecast that tomorrow’s orders will be the average of today’s and yesterday’s. The relationships used in AR can also get more complicated. Moving averages (MA) uses the error from yesterday’s forecast as a predictor for today’s. Autoregressive forecasts have the advantage of being quick to react to recent trends.

The second method is with time-indexed components. The simplest version of this is a delta function (a one-time spike or drop) for exogenous predictors. For example, we might have a component in our model that drops the forecast by 50% on a holiday. More complicated components can be used for long-term trends. For example, orders rise in the winter and drop in the summer. A long-term sinusoidal component can mimic that rise and fall.

Note that these methods are not mutually exclusive. In fact, many of the forecasting models we use at Grubhub employ both autoregressive and time-indexed predictors. This allows our forecasts to be flexible in accounting for market trends.

Control Models

Now, let’s say that one morning our weather database goes down. Or perhaps a global pandemic invalidates many of the long-term patterns our model trains on. You know, things that might happen. Suddenly, our autoregressive and component-based global deep learning models are of little use.

This is why we have simpler back-up models we can rely on. For instance, a model that for each forecasted day calculates a weighted average of the same day for the previous four weeks.

This model is also a baseline for developing new models. If a model we create can’t outperform this naive model, we drop development and move on to the next.

Ensembles

We’ve now discussed many different forecasting models. The ones we use, and their properties, are as follows:

You can see that we have five different models, each used to forecast hundreds of regions. But in the end, we need to have a single value for each region’s forecast. To accomplish that, we rely on a technique known as ensembling.

Perhaps not unsurprisingly, it turns out that when you combine a bunch of models, you tend to outperform any single other model. And that’s exactly what we do — take a simple median of the forecasts from each of these predictors.

To return to our original question — how do we know how many orders there’s going to be in a given area — that’s how! We take that median value and send it to our coworkers in scheduling, who, in turn, make sure there are enough Grubhub drivers in that area to meet demand.

What’s Up Next?

At this point, you should have a pretty good conceptual understanding of how we manage order volume forecasting: what the data looks like, how the models work, and how they combine to get our predictions.

Time series forecasting is a powerful methodology, with applications across industries. In healthcare, it can help track biometric trajectories. Environmental scientists can use it to predict changes to the climate. Quants in the finance industry use it for investments. And as more actors enter the space, research has continued making the models better and better.

But that’s only part of the story. It’s great to have a model, but how do we use it?

Stay tuned for part 2!

“I See Tacos In Your Future”: Order Volume Forecasting at Grubhub was originally published in Grubhub Bytes on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building an Inclusive Technology Organization at Grubhub

Bhuvana Kulkarni Husain — Wed, 17 Mar 2021 15:41:35 GMT

Photo by Hannah Busing on Unsplash

Building an Inclusive Technology Organization at Grubhub

By: Anthony Cardone and Bhuvana Husain

Millions of people from different backgrounds use Grubhub’s products every day to order from hundreds of thousands of restaurants across the country, and our goal is to bring that same variety of perspectives to our work environment. At Grubhub, we celebrate differences and believe that our diversity strengthens the quality of our products and technology. We’re committed to building an equitable and inclusive technology organization that ensures many voices are heard in our decision-making processes.

Project Overview

As part of our efforts related to diversity, equity, and inclusion, we’ve been working to remove subtly racist terminology from our systems, tools, and processes. The English language contains many colloquialisms that have been around for centuries, and which imply, even today, that white means good and black means bad. Terms such as blacklist, whitelist, master, and slave, are antithetical to our goal of inclusivity at Grubhub and must be changed.

To that end, we’ve been working throughout 2020 and into 2021 on a project to replace these and other words throughout our systems, services, configuration, user experiences, training materials, documentation, and communications.

How It Started

To kick things off, we identified non-inclusive language and suggested alternative terms:

These lists of subtly racist terms and alternative neutral terms were identified by a group of senior leaders, with some crowdsourcing of ideas from team members across our technology organization. When we announced this project, feedback from our colleagues was extremely positive; several people reached out to say thank you and that they felt heard or empowered because of this effort, much of which was led by Clark Malmgren, a Vice President of Engineering, with support from Technical Program Managers (including the two of us) within the organization.

We consulted several articles published by psychology researchers and professionals as references, including “Blacklists” and “whitelists”: a salutary warning concerning the prevalence of racist language in discussions of predatory publishing, Confronting the Language of Subtle Racism, and Why is ‘black’ always a bad word? | Editorials.

After this, we put together a list of almost two hundred services within our tech stacks and their corresponding codebases, databases, and documentation, all of which needed to be updated. We then proceeded to coordinate with hundreds of engineers around the world to make those updates. We’ve been tracking the updates via spreadsheets and tickets, ensuring that leads continue to plan these changes into their teams’ schedules as needed. We’ve also incorporated the ongoing tracking of these changes into our regular sprint planning processes, as well our weekly and monthly review meetings.

How It’s Going

This has not been an easy effort — think of how many places the ‘master’ branch of source code is referenced — but it’s a worthwhile one. We’re not done yet, but we’re proud to say that over 85% of our services have been updated, and the rest have been scheduled to be updated soon. We’re also working with our vendors to address external company tools and dependencies where we couldn’t simply change the terms (i.e. Contentful, Github, and others). Our engineers submitted requests to those vendors to ask them to consider updating their verbiage; Github already had a plan in motion and we’re planning to implement their solution to move away from ‘master’ to ‘main’ for our primary branch.

This project has led to changes among all of our employees, partners, and customers. We’ve evolved how we think about and talk about what we do. And we’ve observed team members holding each other accountable in group meetings, code reviews, comments in documentation, and more. One team member suggested to a restaurant partner that they substitute the term ‘blackout’ with ‘temp close’. Another employee noted that a campus partner has used our recommendations to stop using the terms ‘whitelist’ and ‘blacklist’ entirely across their own organization.

Doing More to Build an Inclusive Environment

We recognize that changing this terminology is only one part of a multi-year, multi-team, company-wide strategy for greater diversity, equity, and inclusion. In addition to the effort described above, we’ve launched and added to our Voices Council and Employee Resource Groups. Throughout 2020, we also hosted a webinar series focused on inclusion, featuring notable guest speakers such as Bryan Stevenson, Katrina Lake, Linda Johnson Rice, Luvvie Ajayi, and Chloe Valdary. In 2021, we’ve hosted multiple events in honor of Black History Month and Women’s History Month, amplifying voices from underrepresented groups. And all our people leaders are attending inclusive leadership training with industry experts in this field from Paradigm.

We’ve also focused on improving our job descriptions to be more inclusive, changing our sourcing techniques to find more diversity in our candidate slates, and enhancing our interview techniques to better include underrepresented populations. We’re proud to have launched a Reconnect Returnship program with PathForward, targeted at individuals who have at least five years of professional experience and have been out of the paid workforce for at least two years to focus on caring for a child or other dependent.

If you’d like to join us in reinventing the way that restaurants and diners connect, we’d love to meet you! Check out our Careers site to learn more about our team, our culture, and our jobs, and tell us what you’d bring to our table.

Building an Inclusive Technology Organization at Grubhub was originally published in Grubhub Bytes on Medium, where people are continuing the conversation by highlighting and responding to this story.

Dask and TensorFlow in Production at GrubHub

Alex Egg — Tue, 18 Aug 2020 17:44:35 GMT

Dask and TensorFlow in Production at Grubhub

We recently caught up with Alex Egg, Senior Data Scientist at Grubhub, about modern data science and machine learning methods to understand the intent of someone using Grubhub Search. As Alex told us,

“Search is the top-of-funnel at Grubhub. That means when a user interacts with the Grubhub search engine, they want to be able to service their request with high precision and recall. One way to do that is to understand the intent of the user. Do they have a favourite restaurant in mind or are they just browsing cuisines? Are they looking for an obscure dish?”

In this post, we’ll summarize the key takeaways from the stream. We covered

Classic distributed ETL pipelines with Dask Dataframes,
Weak-supervision (labeling) w/ Snorkel & Dask (what a modern combo!),
Language Modeling and deployment w/ Tensorflow.

You can find the code for the session here. You can also check out the YouTube video here:

https://medium.com/media/d157660e4afcb7a7c3ec8ddfba213eef/href

The task

Figuring out user intent is highly non-trivial, particularly when you don’t have labelled data of search queries mapped to actual intent. A question Alex asked us straight away as motivation, if a user types in “French” into the search field, what do you think they’re looking for?

Example Query with “French” Intent

There are several approaches to deal with the challenge of not having labeled data. A common approach is hand-labelling but this can be both time and resource intensive: you can get employees to do this, friends and family, you can outsource to crowd-sourcing platforms such as MTurk, and you can use products like ScaleAI.

A more modern approach is to use weak supervision, where “where noisier or higher-level supervision is used as a more expedient and flexible way to get supervision signal, in particular from subject matter experts (SMEs).”

In Grubhub’s case, they provide a set of target labels and the options for weak supervision include utilizing

a pre-existing knowledge base (Distance Supervision)
pre-trained models
Keyword lists
Heuristics (such as using regular expressions to identify addresses!)

Such weak supervision is embedded in a broader production pipeline, schematized here:

E2E Data Science Workflow

We see therein

ETL, powered by Dask,
Weak supervision (Snorkel and Dask),
Modeling, using Keras and Tensorflow,
Nbdev, a key infrastructural piece infra piece that allows Grubhub to integrate quickly and make deployment easy, and
Model deployment and maintenance, using TensorFlow

Let’s now go through each of these in a bit more detail!

ETL, Dask DataFrames, and Weak Supervision with Snorkel

Snorkel Workflow. Image from Ratner et al., 2019.

At Grubhub, Alex and his team have got so many queries (millions per day) they want to build a basic model from that pandas DataFrames aren’t enough. For this reason, they use Dask DataFrames

For this session, Alex used an Amazon EC2 P2 instance: he had a GPU and 4 CPUs and the plan was to get a Dask local cluster up and running on them (and he succeeded!). For a bit more context, all ETL was from a data lake and powered by Dask.

As stated above, for the weak supervision, he required a label set to map searches queries to:

He also required a set of rules / heuristics to interrogate the most likely label for each search. Here you can see the function that determines how likely it is for a query to be an address

@labeling_function
def address(x):
  query = x["query"]
  
  exp = '\d{1,4} [\w\s]{1,20}(?:street|st|avenue|ave|road|rd|highway|hwy|square|sq|trail|trl|drive|dr|court|ct|parkway|pkwy|circle|cir|boulevard|blvd)\W?(?=\s|$)'
  regex = re.compile(exp, re.IGNORECASE)
  
  if regex.search(query):
    return LABEL_MAPPING['ADDRESS']
  else:
    return ABSTAIN

Note that the function can also ABSTAIN. This speaks to the fact that, for any given query, we essentially get all the functions for the different labels to vote on what the most likely label is.

There were several steps involved in getting this up and running and we got to see the Dask dashboards in action while extracting the search query texts from the data lake here. Hearing Alex and Matt geek over the realtime value of the dashboards was a lot of fun (for Hugo, anyway):

Then Alex performed weak supervision with Snorkel, the rules / function (called “Labeling Functions” in the Snorkel docs) vote and then label prediction were made — note that the outputs were soft predictions / probabilities and that you need to threshold to get the actual prediction.

We saw rules / heuristics such as the gibberish detector (can you guess what this does?) and entropy detector (detects accidental searches such as ‘cdcssdcads’).

We also saw the labeler in action with the example “sushi”, as a sanity check:

Example Labeling Function judgments for query “sushi”

Note that there is disagreement (conflict) between labeling functions here and that this is fine and natural. How do you deal with this? Well, Snorkel has sophisticated methods to do this for you, which you can find out more about here (the relevant section starts at “Our goal is now to convert the labels from our LFs into a single noise-aware probabilistic label per data point”)!

Determining User Intent across a huge dataset

It was then time to use our labeling functions to determine user intent across a large number of queries. Essentially, the task here is to map these labeling functions across our huge dataset with Dask! (Side note: Matt was pleasantly surprised to find that Snorkel has a Dask sub-module!). In the process, we once again saw the Dask workers in action again here and it was hypnotic as ever.

Matt also showed us the profile plot and pointed out where you can see SSL, decompression, boto, stuff from aws, and s3 data (likely parquet):

Dask Profile Plot

We wrapped up this section by converting the output soft labels / probabilities to hard labels in order to then train the model and deploy it.

Training the model using Keras

Before jumping into training immediately, like any good data scientists, Alex checked out the data to see what was going on. First he noticed that there was a serious class imbalance: the data was skewed, likely according to some sort of power law.

We discussed methods for dealing with such imbalances, including downsampling and weighting the loss function (for example, according to class weights and/or counts). Alex used a Dask utility truncate_tail that he wrote to downsample the data, mentioning that we would weight the loss function later also.

We discussed the type of models we could use:

Naive Bayes generally works pretty well out of the box but doesn’t care about order,
Subword tokenization (creating sets of n-grams) is relatively lightweight and considers character order so Alex went with this, combined with a logistic regression model.

He used Keras to pass the n-gram embeddings into a max_pooling layer, which essentially averages together all of the embeddings, and then we watched the epochs roll on in as he trained the model and plotted the learning curves:

We also saw, for example, that the model gave perfect predictions for ‘Tobacco’ searches, even though there was relatively little data in the training set with this label!

We tested a variety of queries and saw that, even when ‘cigarettes’ was misspelt ‘cigaretes’, the model worked correctly (the model is robust to spelling!):

Deploying the model using TensorFlow

Now we came up against a pretty serious challenge: how to get this model in production. For productionzed prediction, you would need to reimplement a lot of things in Java and a great deal would get lost in translation.

An alternative to this is to do everything in TensorFlow or, as Alex said, in a “nice hermetically sealed TensorFlow runtime.” Alex also made the key point that this is not merely a technical question, but inherently a social, cultural, and political challenge: if you want to get your ML models deployed within your org, you had better make it as easy as possible for your colleagues to do so.

To this end,

Alex reimplemented the pipeline in TensorFlow Estimators API,
exported the model using the SavedModel format,
walked us through the Python client and then the Java client,
mentioned that he can then easily turn this in to a PyPi package using nbdev (the point of deploying this as a package is so that they can instrumentalize it using their internal job system), and
told us that this is the model that is actually running in production on Grubhub: Dask, Snorkel, TensorFlow and many other packages for the win!

Example of Tensorflow In-graph ops which help avoid train/test skew.

Originally posted on Coiled.io

Dask and TensorFlow in Production at GrubHub was originally published in Grubhub Bytes on Medium, where people are continuing the conversation by highlighting and responding to this story.

“Just What I Needed”: Making Machine Learning Scalable and Accessible at GrubHub

Kyle Jablon — Mon, 03 Feb 2020 16:52:35 GMT

“Just What I Needed”: Making Machine Learning Scalable and Accessible at Grubhub

Photo by Tiard Schulz on Unsplash

Data scientists at Grubhub develop and deploy predictive models to improve business decision-making, as well as in-app diner, driver, and restaurant experiences. Until recently, taking models from the prototype stage to run as scheduled jobs in production was both challenging and time-consuming, as we lacked suitable infrastructure and standardized tools. This, in turn, resulted in multiple bespoke solutions, duplicated code, and a good deal of model maintenance overhead.

Over the past few months, data scientists and machine learning engineers have partnered to develop our first machine learning platform: a suite of tools designed to democratize and increase the velocity of machine learning model deployments. This post will summarize the need for such a framework, as well as provide a technical overview of our particular implementation and lessons learned during the development process.

The Problem

Our data science teams are embedded in business, product, and technology groups in order to solve problems and implement new solutions with data. This decentralized organization comes with certain advantages, namely faster iteration cycles, but also creates unevenness in technical ability. Over time, some teams have developed their own custom pipelines for model deployment, while others, lacking software engineering expertise of their own, have not benefited from these tools.

This has meant that, in practice, production machine learning at Grubhub has historically been accessible to only a small number of teams. Moreover, the differing stacks on which models were deployed created technical debt and limited our ability to scale solutions across the organization.

Introducing the ML Framework

To address some of these issues, we developed a platform to provide reliable, reproducible machine learning pipelines that simplify the process of training and deploying models for data scientists across the company. This platform has five goals:

Lower the barrier to entry for model deployment
Reduce time between model development and deployment
Minimize technical debt associated with maintaining multiple bespoke solutions for machine learning
Codify standards and best practices for developing and deploying models at scale
Provide a common forum in which data scientists can collaborate

Projects are built within this framework using a number of Python libraries that have been designed to optimize developer efficiency and improve maintainability and extensibility:

Data Access Objects: This layer manages and tracks access to the data warehouse. The boilerplate mechanics of reading and writing data are abstracted away from data scientists, reducing the elements we need to worry about. This library encourages efficiency in data management in addition to providing features such as data lineage tracking, query sanitization, and schema validation.
Feature Engineering: We maintain a shared feature pool to which different teams and projects can contribute. Machine learning models will often use the same features, and maintaining these features centrally reduces code redundancy, cuts time to deployment, and improves cross-team collaboration.
Model job/application: The application layer handles all project level configurations, including model specification and parameters, compute resource requirements, training, evaluation, scoring, monitoring, and project builds.
Utility libraries: These repositories store common code and tools shared between projects, from simple classes like a standard logger to more sophisticated interfaces such as a standard PySpark application.

The modularization of this architecture facilitates sharing of code across different data science projects. It also enforces machine learning best practices, ensures consistency in configuration, and standardizes the way in which models are deployed at Grubhub, enabling greater reproducibility and elevating the quality of our model outputs.

In the next section, we will review the process of deploying a model to illustrate how the various layers of code interact.

Developing and Deploying ML Models

Mapping out the workflow of a data scientist bringing a developed model to production.

We divide our model deployment step into three phases: systems design, feature engineering, and model deployment. These three phases constitute separate compartmentalized workflows and are intended to minimize context switching.

ML Production Design

After a model has been prototyped, typically in a Jupyter notebook environment, the next step involves systems design of the productionized model.

This design step is arguably the most important of the deployment process, as there is a high time cost in switching back and forth between designing, planning, and coding. To help data scientists minimize such context switching, and so that they can better predict their workload, we created a series of standard questions to guide data scientists through the planning process. Documenting these decisions and making them available to our data science community results in more optimal solutions, and ultimately makes the development process much smoother.

These are the standard questions that should be answered before productionization:

What are the input data locations and how often are they updated?
How is the data labeled? What’s the risk of getting stuck in a feedback loop?
How are the model features defined? Are there feature engineering job requirements?
What libraries does the model depend on? (e.g. scikit-learn, XGBoost, TensorFlow, etc.)
Will predictions be made online or offline?
What downstream jobs or processes will depend on the output of this job? What are the SLAs (Service Level Agreements)?
What are the compute resources needed to execute the jobs within the agreed upon SLAs?
How often should the model be retrained?
What metrics will be used to evaluate the model? If the model is supervised, is there a prepared holdout set that can be used for validation?
What is the plan for monitoring train and predict jobs, as well as the model performance? What are the consequences of training or prediction job failure?

Once sufficient responses have been documented and reviewed, we can begin building the pipelines. Development involves taking a proof of concept model, often tested on small-scale, sampled data, and translating it to work at scale. This process typically consists of one or more PySpark applications that handle everything from feature engineering and model training to validation and scoring.

Feature Engineering

Model features are computed in PySpark applications that run on a periodic scheduled basis. The primary benefit of computing features in jobs separate from model training or prediction is to take advantage of existing features built in previous projects. The ML Platform team maintains a series of PySpark applications and UDFs (User Defined Functions) that can be easily extended to include new features not already part of the existing feature pool.

Model Deployment

This phase includes, at minimum, developing a reproducible job to train the model, and if applicable, a job that will make predictions on new data. The training job may or may not run at the same cadence of the prediction job, and will typically produce a serialized model artifact that is stored and versioned.

Once a job has been successfully tested in the development environment, and the output validated, it can then be deployed to production, following a code review process. Deploying an offline model for batch prediction involves scheduling the feature engineering and model training/scoring jobs in Azkaban, a scheduler tool we use to manage computing resources and execution of the model run. For online models, the training job is also scheduled in Azkaban, but the trained model is converted to an object that can then be deployed in our applications for real-time scoring.

Lessons Learned

Here are a few key takeaways we learned while building the first version of our ML Framework:

Establishing new standards and processes can uncover previously hidden tech debt in legacy projects

Legacy projects that seemed to have been running smoothly, once audited were discovered to be in need of unplanned updates. For example, we learned that a model had not been re-trained since its initial deployment, and that the code needed to do so was in a Jupyter notebook outside of the deployment pipeline.

Tech debt is an unavoidable side effect of developing code in a changing environment, and it is crucial to account for such maintenance when planning out project deliverables.

Spark and distributed computing have a high learning curve

This might seem obvious, but it bears repeating. At Grubhub, much of the source data we use to prepare model training datasets is large enough that processing it necessitates a distributed computing tool like Spark.

Gaining proficiency in Spark, given its complexity, can be a significant time commitment, and creating data models, as well as writing and debugging Spark jobs, is often a bottleneck during a project’s development. We have somewhat mitigated this friction through new tooling and training sessions, and we are continuing to work to lower the barrier to entry even further.

Python dependency management is hard

Python has a wealth of packages maintained by the open-source community, making it simple to experiment with a variety of models and algorithms. However, when it comes to creating reliable, automated builds of a repository that has many dependencies, it is easy to fall into dependency hell if package versions are not tightly controlled.

Fortunately, we have adopted tools like pip-compile to auto-generate requirements files, and we are working with the infrastructure team to integrate containers into the dev and build process.

What’s to Come

We’re thrilled that what we’ve built so far has contributed to streamlining the machine learning and model development process at Grubhub, and we plan to democratize machine learning — and better support the needs of our business — even further.

Here are a few of the projects we have planned:

Integrated UIs for monitoring and data/model visualization

Currently, much of the metadata associated with model development exists in disparate sources like configuration files, database tables, and monitoring tools. We plan to integrate surface such information to make it easy for anyone in the company to search past experiments, view model validation reports, and share data visualizations. These dashboards will also enable us to monitor model performance, and feature drift over time.

A Feature Metastore service for improving discoverability and sharing of existing features

We want to expand on existing efforts to consolidate feature engineering work by making a Feature Metastore service that will enable discovery and reuse of existing features as well as publishing of new features.

More automation

A good deal of work has already been done to abstract away the most common aspects of data pipeline applications, but there is still plenty of boilerplate code that needs to be written for a new project. We would like to make it possible for new projects to be created on the fly from templates using a few criteria as inputs.

Special thanks to Robert Doherty, Damon Mok, and Michelle Koufopoulos on the writing of this blog post.

“Just What I Needed”: Making Machine Learning Scalable and Accessible at GrubHub was originally published in Grubhub Bytes on Medium, where people are continuing the conversation by highlighting and responding to this story.

Search Query Embeddings using query2vec

Alex Egg — Mon, 04 Nov 2019 16:57:06 GMT

Query2vec: Search query expansion with query embeddings

query2vec: Latent Query 3D Embedding Space for “Maki Combo” search query

Discovery and understanding of a product catalog is an important part of any e-commerce business. The traditional — and difficult — method is to learn product interactions by building manual taxonomies. However, at Grubhub we leverage recent advancements in Representation Learning — namely Sequential Modeling and Language Modeling — to learn a Latent Food Graph. With our strong and scalable understanding of the product catalog, we are able to power better search and recommendations — and in a much more sustainable fashion — than if we were maintaining an expensive handmade taxonomy.

Figure 1: An example food knowledge graph. Yellow nodes are Dish types, grey nodes are Cuisine types, and white nodes are particular subcategories.

Knowledge Graphs

E-commerce companies have the difficult challenge of understanding their inventory catalog, especially when the catalog can grow unbounded, and where new data (for example, restaurants and menus) is unstructured. The goal is to be able to systematically answer questions such as:

What Dish type is Dan Dan Noodles?
What Cuisine type is Dan Dan Noodles?
What are some trending Asian noodle dishes?
What are 3 semantically similar restaurants to Le Prive NYC? (cuisine-level semantics)
What are some related cuisines to Udon Noodles? (cuisine graph traversal)
What are menu items similar to Blueberry Pancake Breakfast? (semantic matching)
What are some synonyms for Pierogi? (cross-lingual query expansion)
What are 3 dishes similar to Kimchi-jjigae? (dish-level semantics)

At Grubhub, it is instrumental to our business to be able to answer taxonomic questions about any item in our inventory. The typical way to answer these questions is to build a product Knowledge Graph. These graphs tend to be rule-based systems, and they come with certain flaws: namely, they’re expensive, and can be difficult to scale and maintain.

An example knowledge graph is depicted above in Figure 1. Consider the query item, “Dan Dan Noodles.” Without a knowledge graph, it might be difficult to know where this item sits in the universe of food. Using fuzzy matching, a machine can possibly infer from the name that the dish has noodles. However, it cannot know that it is Chinese cuisine from text matching alone. We might be able to develop a set of rules with lookup tables to store the cuisine relationship, but such a system would likely be difficult to maintain, and it wouldn’t be able to handle all cases.

Currently, Grubhub has a rule-based knowledge graph system in place, although it faces some of the same issues described above: accuracy, scalability, expense, and maintenance. In order to overcome these issues, our hypothesis is that we can augment the current system using recent advances in Representation Learning and Neural Language Modeling.

Representation Learning

Representation Learning is a special application of Deep Learning in which latent representations of items (latent meaning they cannot be observed directly) are learned from data. Given these representations of items in the Grubhub universe — for example, menu item dishes and cuisines — we can start making connections between these items, and therefore make relationships. In other words, these representations can be viewed as a latent food knowledge graph.

Some popular techniques for representation learning are:

Unsupervised Pretraining
Transfer Learning and Domain Adaptation
Distributed Representations

In this post, I’m focusing on the most popular techniques: Distributed Representations and Transfer Learning, which is mostly Neural Language Modeling in practice.

Language Models

Due to the large and unbounded scale of our catalog, we must leverage automation in order to understand our unstructured text menu data. If presented in a certain way, language models can learn the semantics of natural human language. Our hypothesis is that if we can train a language model on our unstructured menu data, then we can gain an understanding of menus and restaurants.

The output of most language models are embeddings. Embeddings are dense vectors that typically describe words. For example, the words “chicken” and “roast” should be close together in the three dimensional embedding vector space shown below in Figure 2.

Figure 2: Two example menu embeddings. Relations are projected into three dimensions.

Language Models are often trained generically on a few common tasks, and their ancillary embeddings are then used for transfer learning or domain adaptation for a specialized downstream objective such as classification or ranking. Popular language modeling tasks include: Neural Machine Translation and Next Word Prediction. A common and simple implementation of next word prediction language modeling is the class of word2vec algorithms.

Word2vec

The word2vec class of algorithms has been instrumental in driving innovation in neural language modeling. Due to its simple structure and interpretable outputs (embeddings), it has proven popular in both industry and academia.

The Word2vec algorithm maps words to semantic distributed vectors. The resulting vector representation of a word is called a word embedding. Words that are semantically similar correspond to vectors that are close together. That way, word embeddings capture the semantic relationships between words.

The algorithm learns semantic representation (embeddings) of language based on the Distributed Hypothesis. The Distributed Hypothesis states: words that occur in the same contexts tend to have similar meanings. Therefore, if we can generate pairs of words in the same context and then learn to predict those pairs in a classification setting, we can have a model of language. Consider the example text below from a real Grubhub restaurant menu:

Magret de Canard: Roasted parsnips and celeriac puree with gastrique.

After Preprocessing Normalization:

magret canard roasted parsnip celeriac puree gastrique

The algorithm then runs a window over the sentence and makes pairs of words:

magret canard roasted parsnip celeriac puree gastrique
magret canard roasted parsnip celeriac puree gastrique
magret canard roasted parsnip celeriac puree gastrique
magret canard roasted parsnip celeriac puree gastrique
magret canard roasted parsnip celeriac puree gastrique

List 1: window size 3 with center context pairs. Bold is center word and italicized is context window.

Table 1 below shows an example of some of the possible generated pairs.

Context, Target
canard, margret
canard, roasted
roasted, canard
roasted, parsnip
parsnip, roasted
parsnip, celeriac
celeriac, parsnip
celeriac, puree

Table 1: Window size 3 pairs (not exhaustive).

These pairs are then passed through the skipgram architecture below in Figure 3. For example, if the context word is “canard,” and the target word is “roasted,” then the algorithm will learn the conditional probability of “roasted” given “canard” by minimizing a standard Maximum Likelihood Loss Function in Equation 1.

Equation 1: Skipgram probabilistic interpretation: the softmax probability distribution over word pairs in the same context (Table 1).

It should be clear that after going through all the pairs on a large dataset, words such as “canard” and “parsnip” will be clustered around the word “roasted” in the embedding space in a French cuisine cluster.

Figure 3: Skipgram Architecture where “roasted” should predict “canard” in the softmax layer.

An important point to highlight is that we call this a content-based embedding, meaning it is based solely on curated text content. However at Grubhub, most of our data comes from user-feedback data via logs which we collect and call “behavioral data.”

Embedded Search Queries and Behavioral Feedback Data

When you search and browse on Grubhub, you are providing implicit and explicit feedback on your preferences. More importantly, you are giving Grubhub feedback on how particular items are related. For example, if you search for the delicious French dish “Magret de Canard” and convert on the restaurant Le Prive, and if someone else searches for the cuisine “French” and clicks on Le Prive, then at scale there is strong collaborative feedback that Le Prive offers French cuisine and Margret de Canard is French cuisine.

Below in Table 2 are example user search queries that are clearly windows into user preferences.

User A, shepherd pie italian sambar irish stew shepherd pie Irish stew
User B, seafood lie stella keto
User C, ice cream el salvadoran dim sum octopus cereal bella tres leche grill shrimp chocolate

Table 2: Behavioral Data Example search tokens for three users.

You might wonder: if we can embed words, what else can we embed? Restaurants, menu items, even search queries? In order to do so, we’d have to adapt the Distributed Hypothesis from language modeling to a non-language setting. The goal of the Distributed Hypothesis is simply to have a heuristic to generate pairs of items (words) from sequences (sentences). But we’re in luck: skipgram pairs are automatically generated by user behavioral feedback, when users enter a search query and convert on a restaurant.

The difference between this and word2vec is that the primitive atomic unit we are embedding isn’t a word, but a search query. Therefore this isn’t a language model, but a query model.

Query Expansion

There are many applications of our latent food graph, but the one we’ll be highlighting in this post is an application in Grubhub search, particularly Query Expansion.

To understand the importance of Query Expansion, consider this illustration of the Grubhub search pipeline below in Figure 3.

After a query is submitted by the user, it is preprocessed using standard normalization techniques, then sent to the next stage, where a query is prepared for the search backend. This step includes Query Expansion, where we hope to boost the recall of the query. Moving onto Candidate Selection, full text search is performed using exact and inexact text matching. Finally, candidates are processed with a high precision ranker according to a number of criteria such as revenue, relevance, and personalization, before being prepared for the presentation layer.

Figure 5: Grubhub Search Pipeline. During the Query Building phase, Query Expansion helps to generalize the users intent.

The goal of Query Expansion is to generalize the user’s intent in order to increase recall. Recall is an Information Retrieval metric that quantifies how well a search system finds all relevant items. A query expansion system is useful in these two cases: Long-tail queries (uncommon or very specific, like “blintz;”) and Small Markets (where inventory is limited).

Even in Large Markets like New York City, certain queries cannot be serviced — for example, a regional cross-lingual name such as “blintz,” which in New York City would be called a Pierogi. There are many Blintz to be served in New York, just not by that name. In the case of the Small Market, there may be only three restaurants available, and even if we can’t match the user’s query exactly, we want to be able to show something instead of nothing. In both of these cases, it is useful to generalize the user’s intent, or in other words, expand the query.

Consider the theoretical query expansion example below for Dan Dan Noodles:

Figure 6: A good application of query expansion (theoretical).

If this was a large market like New York City, or a small market like Barstow, CA, the effect of this expansion is higher recall and a better search experience. In New York, the user will find their exact noodles and in Barstow, they’ll probably find some type of Asian cuisine.

How does query expansion work? The classical technique uses synonyms.

Classic Query Expansion

A contrived example of classical query expansion using a thesaurus to find query synonyms is highlighted below for the example query “Dan Dan Noodles” :

Figure 7: A contrived example of classical query expansion.

The only common English noun in the query is “noodle,” which is also an expression for a human brain. Obviously, this is a bad expansion and serves to show some of the difficulties in using a thesaurus, especially in domain specific situations like food.

More robust techniques leverage user feedback data and Representation Learning.

Semantic Query Expansion

By clustering user search queries around converted restaurants and using Representation Learning, we can, in the word2vec fashion, learn a query model. Meaning, through behavior feedback we can gain a semantic understanding of query intent.

If we revisit the contrived “Dan Dan Noodles” example in Table 4, we can now apply the same experiment to a semantic model. Figure 8 shows a screenshot from a tool that we use to QA our embeddings — Tensorflow Embedding Projector. The input query is under the “Search” field and the ten nearest neighbors are listed below.

Figure 8: 10 NN Results for “Dan Dan Noodles” on Semantic Query Model. Note that the strange spellings are a result of the text normalization process.

As you can see, the results of the expansion are very close to the theoretical baseline in Figure 6. It did not make the same mistake as the classical approach, in Figure 7, by using lexical matching.

Query2vec

The model in Figure 9 is called query2vec, as it embeds search queries. The training routine takes pairs, as with any skipgram architecture. However, in this case it is not working on the word level. Rather, the context is search queries, and the targets are restaurants. This is visualized in Figure 8.

Figure 8: Converted queries and their respective restaurants.

After the query2vec training routine is complete, the output artifact — an embedding matrix (query_embedding in Figure 10) — is used to make expansions. To perform query expansion, a K-Nearest Neighbor routine must be performed in the embedding space.

Figure 10: Prototypical skipgram architecture. In practice, the softmax layer at the end of the network is replaced with some type of approximation, like NCE, for performance reasons.

We generated one year of (search query, restaurant) pairs and then trained the model until the loss stopped decreasing based on the Tensorflow architecture in Figure 8. Tunable model hyperparameters are related to the NCE loss and the dimensionality of the embeddings.

At runtime, the model’s embeddings are exported into an Approximate K Nearest Neighbour index for fast run-time KNN lookup. For QA, we explore the embeddings with the Tensorflow Embedding Projector.

To highlight the expressive power of query embeddings, these are examples of expanded queries as annotated screenshots from the Embedding Projector:

Example 1: Semantic Lookup of “alcohol.” A simple lexical match couldn’t do this.

Example 2: Observe the model has a semantic understanding of Italian cuisine, as if it referenced a food knowledge graph.

Example 3: Observe the perceived Food Knowledge Graph lookup with various types of Japanese ramen.

Example 4: Observe the cross-language understanding: Kimchi Jjigae is a traditional Korean stew; results show many other Korean dishes.

Example 5: Mediterranean cuisine understanding. Note that it also catches the typos

Example 6: Query for Burmese Cuisine comes up with a strong match for “tea leaf salad” and “burma.”

Example 7: Query for Asian returns Chinese, Thai, Vietnamese, and Japanese cuisine.

This query expansion project, along with other Representation Learning projects such as proper language models and recommender systems, helps us to understand and explore the Grubhub universe even without a proper food graph.

I’ll close with two final annotated examples. Below, in Figure 11, the nearest neighbors of “udon” almost map 1:1 with a reference hand-made food graph. This is an exciting proposition, as it paves the way for answering the tough questions that users want to know when they search for food.

Figure 11: Latent Food Graph: Query for “udon.” The graph would use query expansion to rewrite the query to include related terms such as “ramen,” “soba,” and “Japanese.”

A similar experiment, in which we map nodes from a proper Mediterranean Food Graph to results from our model, is depicted below in Figure 12.

Figure 12: Latent Food Graph for “mediterranean” query.

As you can see, the model objectively learns the relationships on both a semantic cuisine and a dish level.

Applications and Future Work

To recap, Representation Learning at Grubhub drives search, personalization and general product understanding, and we are excited to share more of our breakthroughs and novel applications in the future.

Search Query Embeddings using query2vec was originally published in Grubhub Bytes on Medium, where people are continuing the conversation by highlighting and responding to this story.