Paula Leonova

My MLOps Journey - Getting Started

2021-06-24T00:00:00+00:00

This is a somewhat lengthy post, but I wanted to provide you, the reader, with a bit more background about how I approach learning, specifically MLOps, in the hopes of either reducing the cold start problem of finding your first resource in the area or perhaps reducing some anxiety around the perceived linearity of learning and the importance of getting your feet wet, even if it means just dipping your toes in (my sad attempt at a metaphor of how it is better to try something, even if it is very small, rather than standing on the sidelines because you are afraid of jumping in headfirst - though, when it comes to swimming, sometimes the latter is actually easier - but I digress).

Below, I’ve outlined three steps that I have been following and the resources I have found invaluable in my learning journey.

Step 1: Familiarize myself with the MLOps landscape

Finding a good starting learning resource to reference makes a whole lot of difference. However, this can take time as there are lots of resources and the first one may not always be a good fit, therefore I tend to jump around a bit in the beginning between different resources until I find something that I can sink my teeth into. My go to strategy to get a solid foundation in a topic is to audit online courses and read through O’Reilly textbooks, which I supplement with blog posts, tutorials and eventually papers. Sometimes I do the latter in order to get a better lay of the land.

Initially, in my Machine Learning Operations journey, I was following Goku Mohandas’ MLOps lessons as he was posting them. I came across his website madewithml.com before the pivot from an AI forum aggregator of sorts to a more MLOps focused course. There were certain topics I wanted to jump to that he hadn’t yet posted about, so I started looking around for additional resources (at time of writing this, he has posted all the lessons and I would highly recommend his free course).

It wasn’t too much later that DeepLearning.ai announced their newest MLOps specialization. I had, for a brief period, however, actually taken a small pause from my initial MLOps goal while Goku uploaded the remaining lessons, to focus on NLP - another topic I was particularly interested in. Thankfully there is no shortage of material to learn in data science (I say that both sarcastically and enthusiastically). I had bookmarked a while back Stanford’s CS 224: Natural Language Processing with Deep Learning and so started going through the material. I wanted to supplement the class with another one and also started taking Deeplearning.ai’s NLP courses, which is where I heard about the latest specialization MLOPs launch. Andrew Ng was teaching the first course in the MLOPs series and since I am a big fan of his teaching style, which tends to focus on practical hands-on experience with just enough theory so you understand the formulas and tradeoffs at a high level, I was really excited about the course. At this point, I got back on track to my MLOps journey.

Step 2: Get to coding as soon as possible

The labs in the first course Introduction to Machine Learning in Production of the MLOps deeplearning.ai specialization were great as they focused on getting a model up and running and were super well documented. I pretty much immediately jumped to these after only an hour of watching the material so I could see how the course content was applied to a problem (I would highly recommend reading through and running the Ungraded Lab Part 1 - Deploying a Machine Learning Model notebook).

I also searched for additional tutorials to see what other approaches there were. I found this fantastic youtube tutorial by Mike Nemke: Rapidly build & deploy an NLP / Machine Learning App with Poetry, FastAPI, Docker, Spacy & GCP and his accompanying blog post here and coded along with the step by step youtube tutorial. I thought that it was very easy to follow and provided just enough information about the various tools used that you knew how they fit into the overall picture. Since the video is a little older, a few dependencies needed to be handled differently, but otherwise you can get an NLP model up and running on the Google Cloud Platform within a few hours.

I’ll admit that I often have more questions after following these tutorials and go back to step 1 instead of 3. I can sometimes find myself going down a rabbit hole by trying to get a solid foundation in every topic/concept I encounter. At a certain point, once I have enough of the basics down however, I need to get to step 3. Inevitably, more questions will arise and it will become an even less linear of a learning process, the goal is however to get a version of my minimum viable product up and running, rather than just following along a tutorial.

I think trying to have every step in MLOps perfected/fully understood is futile (I think this is true for many topics in Data Science). In order for it to be more manageable, from a learning perspective, I think a more iterative approach is best. Get something to work, review reference material, add another part of the pipeline working, look up a new step, just keep further optimizing the steps, the key is do it in parts, so something is working, even if it is not the most optimal at first, afterall “Perfect is the enemy of good”.

Step 3: Apply what I have learned so far to my own problem

Once I get a few practical examples under my belt and am aware of a couple of different approaches, I try to fit my own problem into the solutions I’ve found. I don’t try to upend everything at once, but rather modify parts that will help me get closer to solving my problem. Here, I also try to simplify my own problem, perhaps reducing it to its absolute core, so that my model is not the most accurate but requires less computation and fewer steps. In this case, the goal isn’t a perfect model, but rather an MVP.

This is the most important step and where a lot of learning takes place. It’s also the hardest, but the most rewarding, if you keep at it. Getting practical hands-on experience is key and overcoming the hurdles that you will inevitably encounter will make you a better data scientist as those are the types of problems you’ll face on a day to day.

List of MLOps resources in this blogpost:

Also, if there are any resources you have found invaluable in your journey, I would love to hear about them!

How has my life changed since shelter-in-place (in charts)?

2020-05-16T00:00:00+00:00

via theweirdinstruction

I am no stranger to a Netflix season binge accompanied by some takeout and perhaps a bottle of wine. Youtube and I were also quite well acquainted before the shelter-in-place went into effect. I am familiar with the illusion of the short 5 minute video and how easily it turns into just 5 more minutes and before you know it, it’s been 45 minutes.

So how exactly has my life changed?

Exercise, or the lack thereof

Too bad watching someone exercise on Youtube doesn’t count. Walking from the couch to the kitchen back to the couch, then to my desk just doesn’t add up to that many steps (it’s less than 2k steps, but who’s counting).

I realized that I am a destination walker, I need purpose other than the walk itself. I used to walk to work when I got off the Caltrain in SF, not because I wanted exercise, but primarily because they moved the bus stop one street in the opposite direction of my office. I walked further to avoid paying extra parking fees. I’d walk to the field where pickup ultimate was, and if it was possible to park any closer to the field, I would have.

Nowadays if I get more than 6K steps, that’s a good day (though that was also a good day before, it was just less effort). Maybe I should consider adopting a dog, that is if there are any left in shelters. I have now walked around my neighborhood more in the last 2 months than I have in the 4 years that I lived here (who knew there was such a nice residential area several blocks over and even a couple of “hidden” parks).

A chart of daily time spent moving with a rolling average.

So how does my activity level compare to before? If I just look at time spent moving, I spent an average of 3 hours doing some sort of physical activity before, whether it was walking over to the kitchen in the office or snowboarding on the weekend. I am currently spending 2 hours moving, on average (you can only go on so many walks, imo).

Just eyeballing the chart below, the two density plots don’t look all that different, right? At this point, that’s wishful thinking. Sadly, technically speaking, they actually are quite different (running a Welch’s t-test on the data shows that there is actually a significant difference in average time spent moving). This does not come at that much of a surprise.

Comparison of my active hours before and after shelter-in-place.

What is interesting, though not really surprising if I think about it more, is how much variability I had before. I definitely had a handful of days where I got fewer steps in than I do now, but I would “make up for it” on the weekend (that’s how exercise works, right). Ultimately, those “make up” days, at least time wise (not accounting for intensity), did result in moving the average up. I had enough of those days to offset the low activity days.

Since I have gotten a bit tired of walking, I’ve recently started to add a bit more variety to my activities (and intensity to my occasional workouts), I have started cycling and, having watched enough Youtube videos, actually feel confident enough to give some of them a try (my recommendations Chris Heria, Natacha Oceane, MadFit and MrandMrsMuscle ). I am curious to see the variations of difficulty of my workouts, which I will take a closer look at in a future post.

Hopefully some of these newer habits stick around after we return to some level of normalcy, but who knows, this might be the new normal for a while.

So if I am spending less time moving, what am I spending more time on?

My laptop and smartphone are my new best friends

Let’s be real, they aren’t new friends, we are just getting acquainted on a whole new level.

My previous device usage was like a pendulum; I would have stretches of time that I would spend long hours on the computer and periods where I would achieve a better balance. Now, most days are the same, with an occasional day here and there where I manage to spend fewer than 4 hours staring at a screen.

A chart of daily time spent on a device with a rolling average.

I am surprised that I previously had several days where I spent even more time on the computer then I did during shelter-in-place. I was even more surprised that the number of days which exceeded 10 hours only differed by two days, with 10 days exceeding 10 hours of screen time during shelter-in-place.

I’ve definitely spent a lot of time obsessing over the news, whether it was following along what was happening in Wuhan on Twitter back in January when it all first started or the hourly updates as the shelters started to take place to now as parts of the country start to re-open.

Youtube, HBO and Netflix have been my escape. I watched so much stand up comedy (I feel like I owe a special thank you to the following comedians: Sam Morril, Mae Martin, Josh Johnson, Emily Heller, Jim Gaffigan, Taylor Tomlinson, Rachel Mac), finished a handful of shows (Westworld, Unorthodox, Feel Good to name a few) and started watching youtube videos of tiny homes (I find them theraputic and love seeing how different people design essentially the same tiny space).

I have been able to complete an online course in this time (here’s my final project: A statistical flow chart) and wrap up the first phase of a semi-automated self tracking project used to generate the charts for this post.

Since my day job entails a lot of computer time, I picked up an old textbook, old being a relative term, on TensorFlow just so I can get a break from my screen.

I’d say I have been both productive and uproductive, in a way, my old pendulum self, just within the realm of a device. Whereas before I was spending 67% of my awake time on a device, I am now up to 78% (yikes, that’s a lot!).

Comparison of my device usage before and after shelter-in-place.

This chart, just like the first chart on exercerise, just emphasizes how little variety there was day-to-day. And if we run a formal Welch’s t-test on the data (I am certain that is where your mind went after seeing those Gaussian distributions), I am, statistically speaking, spending on average more time staring at a screen now than I was prior to the shelter-in-place.

Surely I am sleeping more…

The sad reality is that despite being within a 5 second walk of my bed at almost any given moment, I am not sleeping any more than I was before.

A chart of daily time spent on a device with a rolling average.

There is no statistical difference in the average amount of sleep I get now versus what I used to get.

Comparison of my sleep before and after shelter-in-place.

However, it does look like I actually have slightly reduced my restlessness during sleep, which is very surprising because I would have thought otherwise (significant at the 5% level).

Comparison of my restlessness during sleep before and after shelter-in-place.

Summary

To nobody’s surprise, life right now is very different than it was before. How different exactly - in some respects very and in others not much at all.

Dinners and board game evenings with friends have turned into zoom hangouts and sessions on Tabletopia.
Instead of trips over the weekend, we stay in and work.
An occasional body pump class at the local 24 hour fitness is now substituted by an occasional Youtube fitness workout.
Snowboarding was cut short and swimming pools are not an option so I am trying cycling now.
Insomnia caused by inconsistent bedtimes and anxiety is replaced with insomnia caused by inconsistent bedtimes and other types of anxiety.
Visits to friends apartments replaced with walks around our neighborhood and perhaps a balcony conversation with the friends nearby.
Date nights out are now date nights in (watching Westworld or playing a round of Brass Birmingham or 7 Wonders Duel).
Rather than eating out, we cook most of our meals (at least we did initially).
In-person woodworking classes replaced with online courses.
Tri-weekly 3-hour commutes reduced to 5-second commutes.
Visits to see family substituted with more frequent phone calls and an occasional visit in the backyard, 6 feet apart with masks.

Despite all the changes, I feel extremely fortunate that my family is healthy (knock on wood that they stay that way) and that I am able to work remotely.

My daily record of sleep, screen time and exercise.

Why write all this?

Well I got more time on my hands now…

I am trying out new hobbies. I used to journal when I was a kid and I enjoy tracking behavior so I thought I’d combine the two.

Hopefully, if you made it this far (or skipped to the bottom which is what I would have done), that you enjoyed some parts and maybe even found them relatable, perhaps you even discovered a couple of youtube channels or comedians or learned about a new statistical test. :)

How do employers distinguish Data Scientists and Data Analysts?

2018-08-21T00:00:00+00:00

Motivation

Being a data practitioner, I often get asked the question What is the difference between a Data Scientist and a Data Analyst? Though I usually answer this question empirically, I decided to take a data-driven approach and build a model to more systematically identify the distinction between these two roles.

Process

What I ended up doing was collecting job descriptions for Data Scientist and Data Analyst roles posted by the big tech companies in Silicon Valley on job boards and training a model on a subset to see if I could accurately predict the remaining titles from just the job descriptions.

For a more in depth explanation of the process and accompanying code for this project, please see my my github repo.

Results

I ended up training a Multinomial Naive Bayes model to predict the job titles. My final Multinomial Naive Bayes model had an ROC AUC of 88%. Unsurprisingly, the top key words/phrases for Data Scientist were: machine learning, models, algorithms while that for Data Analysts were: reports, dashboards, and excel.

Takeaways

Examining the results that my model classified incorrectly actually gives insight into employers and their expectations.

There are several reasons why a company might choose to display a title where the description does not match the role. Lyft, for example, on their blog, wrote the following article What’s in a name?, where they explained that they strategically chose to change the title of a Data Analyst to a Data Scientist to retain talent (in my model, this role comes up as a false negative). However, instead of encompassing both roles in one title, they updated that of Data Scientist to Research Scientist. Others have done it to attract talent and get a pool of applicants that are simply drawn to the Sexiest Job of the 21st Century, as proclaimed by Harvard Business Review. While others have changed the title to Data Scientist, but kept the Data Analyst description to attract more skilled workers.

I am curious if other companies will follow this example and broaden the definition of a Data Scientist. Will more companies use these two terms interchangeably or create new terminology like research scientist?

Frequent terms establish the baseline knowledge expected in both roles, while frequent but unique terms per role highlight key differences.

It is no surprise that SQL and analysis show up frequently in both postings (see chart below). What is interesting, however, is that statistics shows up in both. Context is very important; the words that show up surrounding this term provide additional information about the level of expertise required. Upon further inspection, it looks like this term frequently shows up alongside a list of other quantitative degrees, for both roles, so it is less surprising than it was at initial glance.

Another term that appears frequently for both is Python. The frequency of which this word shows up for Data Scientist roles is, however, considerably higher than the times it appears in Data Analyst JDs. Almost 90% of the data science roles contain the key term, while only 60% of the Data Analyst JDs have it, indicating that Python is an expected skill for a Data Scientist.

Click image to enlarge

Surveying the top terms for Data Scientist, results in a collection of words that are more technical in nature. Some of the top terms are: machine learning, platform, algorithms, models, Java, programming, development (see chart below). Because data science is more of a mix of a statistics and a computer science, these terms are not surprising at all.

Click image to enlarge

For Data Analysts, the key words tend to focus more on information dissemination, whether it be through through verbal or written form: reports, reporting, dashboards, communication, verbal. Data retrieval and organization is another theme that comes up: excel, strong SQL, SQL skills, trends.

Click image to enlarge

Applications

For someone who is looking to go into the field of analytics, perhaps focusing on the skills that overlap between the two roles could be a good starting point. However, for someone looking to transition from a Data Analyst role to a Data Scientist role, focusing on the most frequent distinct skills could be of better use.

For employers uncertain what to add to their job descriptions, the above key terms could be used to determine what is the industry defacto and then modify the requirements based on company needs.

Final Thoughts

Though there is quite a bit of overlap between these two roles, there are enough unique key terms that are common for one role but not the other that help to differentiate the roles. As a result, the model did fairly well in distinguishing these two roles. However, as industry standards change and new hybrid roles are created (without modification to the titles), distinguishing these roles will become harder.

To address this role inflation, it might be interesting to instead use an unsupervised clustering model to see what roles get grouped and whether new titles could be derived from these clusters.

Introduction to Data Analysis and Visual Design with Tableau

2018-08-03T00:00:00+00:00

Intro

Back in 2017, I was invited to give a talk at Product School at their San Francisco office about data analysis and visualization in Tableau. Since then I have given a variation of this talk at their other location in Santa Clara and was invited to give the talk at San Jose State University to a class of Business School students.

If you would like to watch the talk, you can either watch in on Facebook here or watch it below. If you would like to review the slides, with the animated instructional GIFs, I’ve attached them below as well. The two dashboard views can be found on my public tableau page here.

Process

Data Collection
Dashboard Creation
Slide Deck with Instructional GIFs
Presentations

1. Data Collection

I initially set out to use previous work projects as examples but instead opted to collect my own data to make this project more personal. Since I embedded detailed GIFs into my deck, I did not want to use any proprietary data. I chose to track the places that I visited and length of each visit for an entire month.

I tracked how much time I spent: working, eating out, exercising, doing chores, cooking, etc. One of my personal goals was to learn something new about myself and potentially change some of my hehavior based on what patterns I saw. I actually did end up changing 2 things after visualizing my data.

2. Dashboard Creation

I created two interactive dashboards to explain the thought process behind data design as well as to show the breadth of visualizations available in Tableau. The first one was a general overview of my month, which is typical of the types of high level dashboards companies use to summarize data. These type of views are used to monitor as well as identify any big changes, which would then require further investigation.

For instance, week 21 jumps out at me immediately as there is a dip in the brown category in the upper right chart. There is also something interesting going on in the bottom left chart with the green spike.

General Overview of My Month

The second one was a deep dive into my activities. This is the dashboard I would use to understand the trends and patterns on a more micro level. I could select a different activity and investigate the dip in work that occured on week 21. I could also examine the current chart to see what happened to that green spike (hint this is when I started creating my report and realized I wasn’t working out much, hence I immediately went to the gym).

A Deep Dive into My Activities

3. Slide Deck with Instructional GIFs

Other than having a fun data source to work with and something that everyone could relate to, I wanted to put together a presentation that others could refer back to if they got stuck on something. I decided that GIFs would be a good way to help walk people through the steps needed to complete a chart from start to finish.

4. Presentations

In the above recorded 20 minute talk, I presented some general tips for creating data reports as well as how to build indepth and dynamic dashboards in Tableau. The second part of the talk went into detail about how to build various types of visualizations, where I went over 5-25 second instructional GIFs that I created. I made this second part as modular as possible so that it would be easy to refer back to later if there were any specific questions about creating the charts. Each of my talks also had a 15 minute Q&A session where I answered more general questions (the first one is recorded above).

Concluding Thoughts

I really enjoyed giving this presentation and hope that those who attended my sessions were able to take something away from my talks.

I also learned one or two things about myself from the data that I collected. I realized that I wasn’t exercising as much as I’d like and so I immediately started to work out more (you can see the green spike in the chart above). Also seeing where I ate out and that it was only within a 2 mile radius of my house made me re-evaluate my decisions.

How to set up and evaluate an AB test

2018-08-02T00:00:00+00:00

So you need to run an A/B test and you need to figure out how many users you need in order to have valid results. What does “valid” results even mean? How do you decide what is the proper test to use and what does a p-value mean?

In the following jupyter notebook, which you can access here, I go into how to run an AB Test from start to finish. I’ve highlighted which cells you need/might want to edit so that you can just step through the notebook. Think of it as a step by step guide or a template that you can plug in your own data.

Feedback

I wanted to share some of the kind words that my colleagues used to describe the notebook.

“I used a bunch of your code from the AB testing notebook for proportions in a HP testing notebook. It worked pretty smoothly!”
“This is a fantastic notebook. It is thorough and comprehensive. You’ve left no stone unturned.”
“Isn’t it great how fast I can asnwer all these questions?! (love this notebook)”
“Starting to see some positive results… Thanks again for all your help getting the analysis notebooks together”

The 60 Second Pitch

2015-11-29T00:00:00+00:00

A bit of context

I wanted to share something that I hope one day I can revisit. This was an idea I pitched at a startup weekend that I and several other amazing folks actually ended up working on for a little bit.

The Pitch: learningForest

Imagine 100 people go through a forest, but do not leave any markings on the trees behind. New people entering the forest will be just as lost as the those before them.

Now imagine 100 people go through a forest which HAS markings. (1) They do not have to start from scratch. (2) These markings help to guide them through. (3) And as more people go through, pathways begin to form.

When I am learning a new skill, the internet can feel like that unmarked forest. There are so many resources, I don’t always know where to start or what to even search. It can be quite overwhelming.

I want to create a tool that tracks people’s learning pathways AND an online platform that curates, aggregates and displays that information, so now learners don’t have to start from scratch.

Not only would they be able to see how others have learned before them and what websites and tools they found useful, but they could also find fellow learners on the same learning path or reach out to those who are further along in their progress (path).

The more this tool/platform gets used, the better it becomes, just like how the paths in the marked forest become clearer as more travelers go through.

Teaser