adhadse

No webcam? Use your mobile as webcam on Linux.

Anurag Dhadse — Mon, 04 Nov 2024 01:42:15 GMT

This mini-guide is divided into two parts. The first part is how I tried the nerdy way and the second part in much simpler and straight forward way.

The basic idea of using mobile camera as webcam on Linux is this way:

We get a video stream from the mobile, a server on smartphone device streaming the feed.
The streamed feed is captured by a computer device and dump into a dummy/virtual camera device.
Other applications can then view virtual camera as if it's an actual camera.

Part 1. Manually creating virtual device on Linux and dumping a video feed.

Step 1, I needed an android application which can make my smartphone act as a IP webcam. I chose "IP Webcam" application.

IP Webcam on Google Play Store

Step 2, then I created a virtual device on Linux. We'll install v4l2loopback on Linux using distribution package manager.

# from RPM Fusion
# On fedora, you'll only need to install v4l2loopback, not -utils or -dkms
sudo dnf install v4l2loopback

To create a dummy virtual camera, use these set of commands:

# remove the module
sudo rmmod v4l2loopback

# create a virtual camera device
sudo modprobe v4l2loopback video_nr=10 card_label="Dummy cam" exclusive_caps=1

# start and reload the module
sudo modprobe v4l2loopback exclusive_caps=1

# WARNING; you may need to disable secure boot
sudo modprobe -r v4l2loopback

# list video-for-linux devices 
v4l2-ctl --list-devices  # this should show you the device 'Dummy cam'

Step 3, start the video server on the smartphone, It should start the server on 192.168.1.204:8080. If you visit this IP address, you'll see options like this. This application sends RTSP stream which we can then use to dump into our newly created virtual camera on Linux.

Step 3, I used ffmpeg to dump the RTSP video stream into the virtual camera like this:

# use ffmpeg to stream the rtsp stream to your virtual camera device
ffmpeg -fflags nobuffer \
       -flags low_delay \
       -rtsp_transport udp \
       -reorder_queue_size 0 \
       -i rtsp://192.168.1.204:8080/h264.sdp \
       -fps_mode passthrough \
       -max_delay 0 \
       -copytb 0 \
       -copyts \
       -probesize 32 \
       -analyzeduration 0 \
       -buffer_size 8192 \
       -vcodec rawvideo \
       -pix_fmt yuv420p \
       -threads 4 \
       -thread_type frame \
       -f v4l2 \
       /dev/video10 	# the virtual camera device number

or using this one:

ffmpeg -fflags nobuffer \
       -flags low_delay \
       -rtsp_transport tcp \
       -reorder_queue_size 0 \
       -use_wallclock_as_timestamps 1 \
       -probesize 4096 \
       -analyzeduration 0 \
       -buffer_size 16384 \
       -max_delay 100000 \
       -i rtsp://192.168.1.204:8080/h264.sdp \
       -fps_mode passthrough \
       -vcodec rawvideo \
       -pix_fmt yuv420p \
       -threads 4 \
       -thread_type frame \
       -f v4l2 \
       /dev/video21

Part 2. The straight forward way.

I did noticed that the video stream via my nerdy way, is a bit choppy and has latency.

I stumbled upon "DroidCam". It's a similar application to part 1, but it differs in how it's supposed to be used. It integrates very easily with OBS.

On Linux, OBS or Open Broadcast Software is very easy to install and is used by many live streamers or content creators for video related tasks.

We can easily obtain OBS on Linux from Flatpak.

flatpak install flathub com.obsproject.Studio

Then to work with "DroidCam", we'll need to install a plugin for OBS:

flatpak install flathub com.obsproject.Studio.Plugin.DroidCam

We'll also need to check if OBS virtual camera is listed in v4l2-ctl --list-devices. If it isn't we'll need to add it again:

# if you did tried method 1; we'll need to add another virtual device at /dev/video0
sudo rmmod v4l2loopback # stop the module
sudo modprobe v4l2loopback video_nr=10,0 card_label="Dummy cam","OBS Virtual Camera" exclusive_caps=1,1
sudo modprobe v4l2loopback exclusive_caps=1 # restart the module

v4l2-ctl --list-devices # should show the OBS as /dev/video0 and Dummy Cam as /dev/video10

Restart OBS, and you should see "DroidCam OBS" options in "Sources" tab.

Start your DroidCam application on smartphone and add "DroidCam OBS" as sources.

By default OBS should pick up the IP address of the DroidCam, but incase it doesn't you can manually enter it when you're adding it to sources or edit it afterwards.

And volla, you've a video streamed from smartphone into your OBS.

One last step to this procedure is to start virtual camera by enabling "Start Virtual Camera" in "Controls" panel, so that other application can pick up this stream as camera device.

Start your browser and fire a webcam test if it's working. On Linux, I've noticed Firefox to pick it up very easily, but for Chromium based browser you may need to disable Hardware accelerated rendering.

That's it for today. This is Anurag Dhadse, signing off.

Edit 0. May 12, 2025

Droidcam suffers from latency, when connected via WiFi and for high quality video source.

Instead, I would recommend connecting via USB, allow the popups that appear on your android device like these to enable USB debugging.

Open Droidcam, then on your computer, start a terminal and type this command to forward port on your phone to the PC:

adb forward tcp:4747 tcp:4747

It should output the port number that you forwarded.

Once done that, head back to your OBS and add a source "Browser" and name it whatever you want.

and add in these settings and URL, configure your resolution:

and hit "OK". If nothing popsup, try hitting refresh for the Browser source you just added:

Or, another way you can add the USB DroidCam OBS source is by following all the steps above, and instead of adding a new source, clicking the properties of existing DroidCam OBS source and selecting the device with "USB" written on it.

Reflect on 2023

Anurag Dhadse — Sat, 30 Dec 2023 09:25:31 GMT

This year I'll start with reflecting on the previous year of my life.

Recollect what I learned, what new habits I developed, articles I published, ..., you get it! The progress, good things that happened to me and areas I lacked behind.

The questions I'll answer will be these three:

What went well this year?
What didn't go so well this year?
What did I learn & what I should work toward?

What went well this year?

Full time opportunity at Kavida.ai. I had started working as a Data Science Intern around December (2022) and this year around May I got an offer to work full time as a Junior Data Scientist.

You won't believe how awesome it was to dive into the world of startups, especially in Artificial Intelligence and Supply Chain. It's like living the dream, seriously. I never really fancied the big tech giants; I'm all about backing the guys with crazy ideas set to shake up the world.

Regular exercise for a healthy life. During my college days, it was hard for me to find or even give some time to my health. It would all be back and forth between college and home. Nontheless, after my last semester ended I began giving some time to my overall fitness. The goal wasn't to jack up, but just to get in a better shape overall, and be filled with energy for the day.

Although I did not kept track of what kind of exercises I did, I should start tracking them as well.
Health is everything. Getting good amount of sleep, doing regular exercise really improved my level of activity.

I graduated. Yeah. I'm happy that I no longer have to travel 50 km to and from, every day. But I'll always remember my college friends.

Adapting to analog scheduling. Every day, I would start with the schedule I had written last night on paper. This paper would also eventually become my to-do list for the day to keep my mind free of clutter of tasks I had to remember for the day.

Austin Kleon in his book "Steal like an Artist", says:

"The computer is really good for editing your idea, and it's really good for getting your ideas ready for publishing out into the world, but it's not really good for generating ideas."

This notebook isn't meant for generating the ideas but it serves as the central place for ideas and tasks come back to as I go through my day. Crossing off a task felt with like a sense of accomplishment.

Here is what worked for me, I would separate the page into two section, left section will be filled with schedule. On the right side, it starts with Highlight (of the day), the major task that has to finished at the end of the day, and below it would be To-do of anything that come across my mind or as the day go by.

An Upgrade to my desk. An ultrawide monitor. I always felt a need to work on an ultrawide, especially when I needed to go back-and-forth between tutorial/documentation and my main code editor window.

This was a nice addition and worth the wait. I love it!

2. What didn't go so well this year?

Consistency. I failed to achieve the level of consistency I wanted to. Things improved but not quite dramatically.

I failed to deliver article/blog every week. This was primarily due to my lack of ability to manage my time effectively. I focused more on delivering and completing the task at my full-time role, instead of learning things.

I also failed to post consistently on LinkedIn and share my new findings.

Fewer amount of contribution to Open Source. Working at a startup is no easy task, and finding time to contribute easily became a challenge for me. But, that's good. I need to overcome that.

Read lesser books. I managed to read only three books this year:

Atomic Habits, by James Clear
Steal like an Artist, by Austin Kleon
Architecture Design Patterns for Python, by Harry J.W. Percival & Bob Gregory

3. What did I learn & what I should work toward?

Focus on learning. That's what life is all about. I would die when I stop learning.
Do small, fast low-cost experiments, and then scale it up. Experiments are the key to discovery. If one tool doesn't work for you, check out next. You'll find what you need, or you may create your own.
Bring consistency to the everyday chaos. I always absolutely remember nothing about the little things I did that day, but I do remember the major things that got finished. If I can bring in consistent nature to my habits every day, I'll slowly get back on track. If I do skip for a day, don't let it slide for more than 2.
Keep track of things. I can't improve If I don't measure it. Jot down what you're up to, what you're picking up, and anything else worth remembering.

That's it. Wish you a Happy New Year.

Move to a 21st century terminal in the Era of Linux

Anurag Dhadse — Sat, 04 Mar 2023 08:37:37 GMT

Since I switched to Linux last June (2022), I have been particularly inclined towards using the terminal for almost every developer workflow I know. Using Git, quickly rename files, and quick editing, or install new packages/apps.

My terminal helped me wherever I go.

But I felt something was missing. My default shell (the program which lets you interact with your computer on a terminal by executing commands written by you) was BASH. It's great but it implements key features like Tab completions which is very useful to my workflow, in a pretty weird fashion:

Yup. Tab completions throw every possibility out in the output (stdout). What I wanted was something close to IDE/Text editor like completion menu.

And maintaining it's ~/.bashrc is nightmare.

What I want is a clean, customizable, yet fast shell, fun to use and improves my productivity.

The Nushell

Then I found Nushell. A new type of Shell written in Rust.

GitHub - nushell/nushell: A new type of shell

A new type of shell. Contribute to nushell/nushell development by creating an account on GitHub.

GitHubnushell

0:00

From its config file, conveniently located in ~/.config/nushell directory everything about Nushell is customizable, from prompt to keybindings, tables, and even external autocompleter.

Nushell in itself is a completely different shell with the idea being it sees data as having some kind of structure to it instead of raw bytes of stream. The fact that it's written in Rust makes it even more appealing to use, plus it's Blazing Fast!!!

Even better about Nushell is that it is cross-platform, it works on Linux, Windows, and Mac.

Bash is not bad, but when it comes to user experience, IMO it fails there. Nushell gave me a completely new experience and fun to use terminal. Although it's not POSIX compliant, it works pretty fine with external commands. Even if a command for Nu clashes with external commands of Bash like, ls just adding ^ in front calls the external command.

Other than that the config for Nu, is easily accessible in nu by config nu is way easier to read.

There are other options as well:

Z-Shell, good but not in Rust, requires its own plugin manager
Oh My ZSH, filled with Plugin
Fish, not POSIX compliant

Again, I might be biased and you might want to try the above options as well.

Getting the Prompt Right

The prompt is what you see as the text to the left (or sometimes even on right), when you open a terminal.

Images from Starship documentation (Presets)

I previously used a script in ~/.bashrc to customize the prompt and display info such as git branch, python virtual env name or conda environment name with no icons. Again, plain minimalist look.

With Nu setup, I find out there is easier and yet pretty maintainable way to customize the prompt.

Introducing, Starship.

GitHub - starship/starship: ☄🌌️ The minimal, blazing-fast, and infinitely customizable prompt for any shell!

☄🌌️ The minimal, blazing-fast, and infinitely customizable prompt for any shell! - GitHub - starship/starship: ☄🌌️ The minimal, blazing-fast, and infinitely customizable prompt for any shell!

GitHubstarship

People love to customize it to their liking, and I just wanted a clean minimal look. Starship works not just with Nu, but almost every mainline shell available.

Plus point, again it's written in Rust!

With this I created my prompt which looked something like this:

Pretty Minimalist I guess, hiding every feature :)

External Completions

As of writing, Nu (0.76.0) doesn't support completions for commands outside of nu and do require an External Completer.

There is an issue regarding this to parse man pages and store the completions in .nu files since man pages don't really exist on Windows.

For the time being, we can utilize external autocompleters and this time unfortunately it's not Rust :/

But, it's Go! Hey common, Go is also equally good. Even though it has a garbage collector, doesn't mean it's a dumpster fire like Java😂.

Let's introduce Carapace.

GitHub - rsteube/carapace-bin: multi-shell multi-command argument completer

multi-shell multi-command argument completer. Contribute to rsteube/carapace-bin development by creating an account on GitHub.

GitHubrsteube

Supporting a wide range of default programs you have worked with and new ones like gh for GitHub CLI and rg ripgrep is also supported.

Adding it to Nushell or any shell is also pretty straightforward and clearly documented.

If you don't find your CLI program getting autocompletion, please put a PR to the above GitHub page and contribute. Who doesn't love contributions?

And if you find any bugs or have a new feature you want to work on, strike a conversation in the Issues tab of any of the above repositories and contribute however you'd want.

If you want to use my setup for your terminal, the config files and instructions are on GitHub along with Instructions and minor details.

ConfigFiles/terminal at master · adhadse/ConfigFiles

Config Files used across my system. Contribute to adhadse/ConfigFiles development by creating an account on GitHub.

GitHubadhadse

Have fun with your new terminal.

This is Anurag Dhadse, signing off.

Async/Await, Multithreading, and Multiprocessing demystified

Anurag Dhadse — Sat, 28 Jan 2023 07:34:04 GMT

You also are confused by these terminologies? I was too. But not anymore.

Let's understand them with the help of an example and along with that I'll explain when exactly you should be looking to use them in this mini-article, so you don't have to go through all YouTube videos or articles on the Internet.

Say, you have a GUI app, MVP right now so single-threaded. Right now this thread is usually called, the main thread, which is also a UI thread (all operations asked by the user on GUI are performed on the main thread).

Currently, our app is Synchronous.

Synchronous

Each task inside a single thread do tasks one by one, if any one task is longer the other tasks can’t proceed until that task is done.

Synchronous/single-threaded is best useful for small apps, CLI app, or any program that don’t do any computationally expensive task.

Say, that one task like fetching a database (or an I/O operation), will take a long time, waiting to retrieve the data. The UI will freeze; no buttons will work until data is fetched.

Asynchronous

The whole process is still single-threaded. BUT, fetching the database for data will now be done asynchronously, the task is going to be done by Database, and the main thread will just say this task is Async and needs to be awaited. It will pass the task a callback function to callback when the data is retrieved and the task will return a promise (in JS) or a future (in Rust) which needs to be polled from time to time (in Rust since futures are lazy) done by the main thread until drawn to completion. By the time, the main thread gets a future it goes on executing other tasks and does not wait for that task to complete. To programmers, this still looks synchronous.

Other than tasks that can be delegated to external hardware, Async can also be used with tasks that are OK to be done in a single-threaded application and requires CPU processing power or tasks which are basically waiting, which won’t make UI unresponsive (or stop other tasks from continuing for some waitable period) and less computationally expensive but may take a little bit of time. In Python async can also be used to delegate task that can be handled by lower level code (like in C/C++) that can do multiprocessing themselves.

Multithreading

If our GUI app is doing some task in the background such as parsing a GIANT file, or encoding/decoding a video (if done on CPU) which takes a lot of time and can’t be delegated to hardware (like Database/disk controller in the previous case) then the UI will freeze and stop responding.

The solution is to spawn another thread, another road to the CPU that can do tasks independent of what's going on in the thread it was spawned by. Both threads (or many) will continue doing their own assigned tasks and are going to be part of the App's process. Inside each thread, the tasks can then be done synchronously or asynchronously as required.

In scenarios like this, use Mutli-threading. Delegate the task that requires the processor’s time and is ULTRA computationally expensive to another thread. Threads also share the same resources as the whole process (the whole App is a single process, which can have multiple threads in itself.).

Keep in mind though multi-threading is expensive, since spawning a new thread takes some resources, and switching between them as well. So, if possible stay with async. Unless you are into coroutines, which are lightweight threads.

Now, our UI will be responsive (running on the UI thread), while in the background we do video encoding or any expensive task (on the Main thread).

Multiprocessing

In this we spin up unique processes, each can have many threads but is usually kept single-threaded. They all have their own separate address space, separate memory, and separate resources, and communicate between them using Inter-process communication.

Plus you can spin them on multiple machines separate from your local host machine.

This is best left for a huge data-intensive application program.

Both Multithreading and multiprocessing deal with parallelism, having machine instruction run executed simultaneously. Async deal with concurrency, having the tasks that do not deal CPU to move out of the way and let other tasks to proceed.

That's all for today. Hope you learned something new.

This is Anurag Dhadse, signing off.

Cascade Design Pattern

Anurag Dhadse — Sun, 22 Jan 2023 05:30:11 GMT

Can we break a machine learning problem can be broken into a series of ML problems?

I wonder you never though of that. Sometimes, you can do, but how and when is what is going to be the theme of this article.

So, let's understand the problem first.

Say, we want to train a model for a anomality or for a task that requires it to predict both usual and unusual activity. The model unless done some preprocessing or rebalancing is done on the data, the model will not learn the unusual activity because it is rare. It the unusual activity is also associated with abnormal values, then trainability suffers.

Let's suppose we are trying to train a model to predict the likelihood that a customer will return an item that they have purchased. If we simply train a binary classifier model, the reseller's return behavior is hard to be captured because in comparison to returns made, there are millions of transactions by retail buyers. We might not know at the time of purchase, if the purchase is made by a retail buyer or a reseller. However, from other martketplaces, we have identified when items bought from us are subsequently being resold.

One way to solve this could be to overweight the reseller instances when training the model. But then we won't be able to get the more common retail buyer use case as correct as possible, trading off accuracy on retail buyer and instead optimizing just for reseller use case.

The best way might be to use Cascade design pattern, breaking the whole problem into 3 distinct problems:

Predict whether a specific transaction is by a reseller – reseller or not?
Training one model on sales to retail buyers – retail buyer will return or not?
Training the second model on sales to resellers – reseller will return or not?

Combine the output of the three separate models to predict the return likelihood for every item purchased and the probability the transaction is by a reseller.

This allows for the probability of different decisions on items likely to be returned depending on the type of buyer and ensures that the models in step 2 and 3 are as accurate as possible.

In addition to that, in the first step, we can use rebalancing to address the imbalanced distribution of transactions from retail buyers and resellers.

But, how do we do this?

Solution

Any machine learning problem where the output of one model is an input to the following model or determines the selection of subsequent models is called a cascade.

For example, a machine learning problem that sometimes involves unusual circumstances can be solved by treating it as a cascade of four machine learning problems:

A classification model to identify the circumstance
One model trained on unusual circumstances
A separate model trained on typical circumstances
A model to combine the output of the two separate models because the output is a probabilistic combination of the two outputs

This might look very similar to an Ensemble of models but is actually different because of the special experiment design required when doing a cascade.

Indeed, the subsequent models after step 1 are not supposed to be trained on the actual split of training data separate from the model in step 1 but instead in a union. The subsequent models instead are required to be trained with the inputs of the first model in the cascade and ground truth labels as guidance to the optimization function.

So, the predictions of the first model are used to create the training dataset for the next models.

Also, rather than training the model individually, it is better to automate the entire workflow, by using workflow automation frameworks such as Kubeflow Pipelines, TFX, and many others.

Trade-Offs and Alternatives

Cascade is not necessarily the best practice. It adds quite a bit of complexity and can be hard to debug in case of bad data and hard to maintain. Remember, if the data changes all models in the cascade would be required to be retrained.

Also, avoid having, as in the Cascade pattern, multiple machine learning models in the same pipeline. Try to limit a pipeline to a single machine learning problem.

Deterministic inputs

Splitting an ML problem is usually a bad idea since an ML model can/should learn combinations of multiple factors. For e.g.,

If a condition can be known deterministically from the input (article is from a news website, vs from an individual), we should just add the condition as one more input tot the model.
If the condition involves extrema in just one input (some customers who live nearby versus far away, with the meaning of near/far needing to be learned from the data), we can use Mixed Input Representation to handle it.

The Cascade design pattern addresses an unusual scenario for which we do not have a categorical input, and for which extreme values need to be learned from multiple inputs.

Single Model

Problems which does seem to be simple enough that a large/medium size ML model will be sufficient should stay away from using the Cascade design pattern. These problems imply patterns and combinations which can be implied from the data itself and can be learned by the model.

Internal Consistency

The Cascade is needed when we need to maintain internal consistency among the predictions of multiple models.

Suppose, we are training the model to predict a customer's propensity to buy is to make a discounted offer. Whether or not we make a discounted offer, and the amount of discount will very often depend on whether this customer is comparison shopping or not. Given this, we need internal consistency between the two models (the model for comparison shoppers and the model for propensity to buy). In this case, the Cascade design pattern might be needed.

Pre-trained Model

The cascade is also needed when we wish to reuse the output of a pre-trained model as an input into our model.

Say, we want to train a model that can convert a page full of mathematics formulas into LaTeX. We might have an OCR model that can do this but only if given a photo of a formula and not a page filled with formulas.

We can do a cascade and train a YOLO model to detect the individual formula on a page and then forward this output to our OCR model. It is critical that we recognize that the YOLO model will have errors, so we should not train the OCR model with a perfect training set of photos and corresponding LaTeX formulas. Instead, we should train the model on the actual output of the YOLO.

This is a common scenario where when using a pre-trained model as the first step of a pipeline is using an object-detection model followed by a fine grained image classification model. In that case, Cascade is recommended so that the entire pipeline can be retrained whenever the object-detection model is updated.

Reframing instead of Cascade

Suppose, we wish to predict hourly sales amounts. Most of the time, we'll serve retail buyers but once in a while, we'll have a wholesale buyer.

Reframing the regression problem to be a classification problem of a range of different sales amounts might be a better approach, instead of trying to get the retail versus wholesale classification correct.

Regression in rare situations

The Cascade design pattern can be helpful when carrying out regression when some values are much more common than others. For example, if we want to predict the amount of rainfall from a satellite image. It might be the case that on 99% of the pixels, it doesn't rain. In such cases, we can:

First, predict whether or not it is going to rain for each pixel.
For pixels, where the model predicts rain is not likely, predict a rainfall amount of zero.
Train a regression model to predict the rainfall amount on pixels where the model predicts that rain is likely.

That's all for today. Hope you learned something new.

This is Anurag Dhadse, signing off.

Keyed Predictions

Anurag Dhadse — Sat, 31 Dec 2022 07:40:00 GMT

Up until now you might have only trained a single model, accepting a single or few inputs and probably deployed it so it send back the output sequentially as the model service serve each request sequentially.

Now, Imagine a scenario, you have a file with millions of inputs and then the service needs to responds with a file with millions of predictions. Ah, it's easy.

It isn't! Model deployment services poses scalability challenges. Often requiring the ML model servers to scale horizontally instead of vertically.

You might think that it should be obvious that the first output corresponds to the input instance and the second output to the second input instance. But for this to happen a server needs to process the full set of inputs serially; often requiring vertical scaling and becoming expensive to continue in the long run.

Instead, servers and hence ML models are deployed in large clusters (horizontally) and in the process they distribute the request to multiple machines, collect all resulting outputs and send them back. And hence horizontal scaling can be quite cheap. But in the process, you'll get jumbled output.

Server nodes that receive only a few requests will be able to keep up, but any server node that receives a particularly large array will start to fall behind. Therefore, many online serving systems will impose a limit on the number of instances that can be sent in one request.

So, How do we solve this problem?

Solution

The solution is to use pass-through keys. Have the client supply a key associated with each input to identify each input instance. These keys will not be used as input to the model and hence called pass-through, not passing through the ML model.

Suppose, your model accepts inputs a, b, c to produce an output d. Then let the client also supply the key k along with the inputs as (k, a, b, c). The key can be as simple as an integer for a batch of requests or even UUID (Universally Unique Identifier) strings are great.

Here is how you pass through keys in Keras

To get your Keras model to pass through keys, supply a serving signature when exporting the model.

For example, in this code below the exported model would take four inputs (`is_male`, mother_Age, plurality, and gestation_weeks) and have it also take a key that it will pass through to the output along with the original output of the model (the babyweight):

# Serving function that passes through keys
@tf.function(input_signature=[{
    'is_male': tf.TensorSpec([None,], dtype=tf.string, name='is_male'),
    'mother_age': tf.TensorSpec([None,], dtype=tf.float32, name='mother_age'),
    'plurality': tf.TensorSpec([None,], dtype=tf.string, name='plurality'),
    'gestation_weeks': tf.TensorSpec([None,], dtype=tf.float32, name='gestation_weeks'),
    'key': tf.TensorSpec([None,], dtype=tf.string, name='key')
    }])
def keyed_prediction(inputs):
    feats = inputs.copy()
    key = feats.pop('key') # get the key out of inputs 
    output = model(feats) # invoke model
    return {'key': key, 'babyweight': output}

This model is then saved using the Keras model Saving API:

model.save(EXPORT_PATH, 
           signature={'serving_default': keyed_prediction})

Adding keyed prediction capability to an existing model

To add a keyed prediction capability to the already saved model, just load the Keras model and attach a serving function and again save it.

While attaching our new keyed prediction serving function, do provide a serving function that replicated the older no-key behavior to maintain backward compatibility:

# Serving function that passes through keys
@tf.function(input_signature=[{
    'is_male': tf.TensorSpec([None,], dtype=tf.string, name='is_male'),
    'mother_age': tf.TensorSpec([None,], dtype=tf.float32, name='mother_age'),
    'plurality': tf.TensorSpec([None,], dtype=tf.string, name='plurality'),
    'gestation_weeks': tf.TensorSpec([None,], dtype=tf.float32, name='gestation_weeks'),
}])
def nokey_prediction(inputs):
    output = model(feats) # invoke model
    return {'babyweight': output}

And then add our already defined keyed prediction serving function:

model.save(EXPORT_PATH,
           signatures={'serving_default': nokey_prediction,
                       'keyed_prediction': keyed_prediction
})

Trade-Offs and Alternatives

Why can't servers just assign keys to the inputs it receives? For online prediction, it is possible for servers to assign unique request IDs. For batch prediction, the problem is that the inputs need to be associated with the outputs, so the server assigning a unique ID is not enough since it can't be joined back to the input.

What the server needs to do is to assign keys to the inputs it receives before it invokes the model, uses the keys to order the outputs, and then remove the keys before sending along the outputs. The problem is that ordering is computationally expensive in distributed data processing.

Asynchronous Serving

Nowadays, many production ML models are Neural Networks and they involve matrix multiplication this can be significantly more efficient if done on Hardware Accelerator.

It is therefore more efficient to ensure that the matrices are within certain size ranges and/or multiples of a certain number. It can therefore, be helpful to accumulate requests (obviously up to a maximum latency) and handle the incoming requests in chunks. Since the chunks will consist of interleaved requests from multiple clients, the key, in this case, needs to have some sort of client identifier as well.

Continuous Evaluation

If you are doing continuous evaluation, it can be helpful to log metadata about the prediction requests so that you can monitor whether performance drops across the board or only in specific situations.

That's all for today. Hope you learned something new.

This is Anurag Dhadse, signing off.

Two-Phase Predictions—Hybrid mode of model deployment

Anurag Dhadse — Sat, 24 Sep 2022 17:44:29 GMT

Did you ever try to figure out how Alexa, Google Assistant, or any always-listening digital voice assistant devices are able to respond to every query of the user, without featuring complex AI hardware?

We already know because of the device constraints, models deployed on edge devices need to balance the trade-off between accuracy and size, complexity, update frequency, and low latency.

Cloud deployed models often have high latency, causing bad user experience for voice assistant users.
Privacy is also an issue.

Problems like this is where Two-phase predictions can help resolve the conflict.

The idea is to split the use cases into two phases, simpler phase being carried out on the edge device and when required complex one on cloud.

For the use case we talked about earlier,

We'll have one edge optimized model deployed on the device, listening to surrounding for wake-up words (like, "Alexa", "Hey, Google", etc) to determine if the user wants to begin a conversation.
Upon successful detection, proceed with recording of the sound and upon detection of conclusion of the conversation, send to cloud for complex phase predictions to determine the intention of the user.

This implies the two phases are split as:

Smaller, cheaper model deployed on edge device for the simpler task.
Larger, complex model deployed on cloud and triggered only when needed.

Let's try it out!

Phase 1: Building the offline model

We'll need to convert a trained model to a model suitable to run and store on edge devices. This can be done via a process known as quantization, where learned model weights are represented with fewer bytes.

TensorFlow, for example, uses a format called TensorFlow Lite to convert saved models into a smaller format optimized for serving at the edge.

This approach is termed as post-training quantization. The goal is to find the maximum absolute weight value, $m$, then map it to floating-point range (often float32) $-m$ to $+m$ to the fixed-point (integer) range $-127$ to $+127$. This also requires the inputs to be quantize inputs at inference time, which TFLite automatically does for us.

i.e., from 32 bit floating point values to 8 bit signed integers, reducing size to 1/4th of the original model.

To prepare the trained model for edge serving, we use TF Lite to export it in an optimized format:

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
open('converted_model.tflite', 'wb').write(tflite_model)

To generate a prediction on a TF Lite model, we use the TF Lite interpreter, which is optimized for low latency. On edge devices, the platform-specific libraries provide APIs to load and generate predictions/inference purposes.

For this, we create an instance of TF Lite's interpreter and get details on the input and output format it's expecting:

interpreter = tf.lite.Interpreter(model_path="converted_model.tflite")
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

The input_details or output_details is a list with a single dictionary object specifying the input/output specs of the converted TF Lite model, which looks like the following:

[{'name': 'serving_default_digits:0',
  'index': 0,
  'shape': array([  1, 784], dtype=int32),
  'shape_signature': array([ -1, 784], dtype=int32),
  'dtype': numpy.float32,
  'quantization': (0.0, 0),
  'quantization_parameters': {'scales': array([], dtype=float32),
  'zero_points': array([], dtype=int32),
  'quantized_dimension': 0},
  'sparsity_parameters': {}}]

We'll then get the prediction from our validation batch to the loaded TF Lite model as follows:

input_data = np.array([test_batch[42]], dtype=np.float32)
interpreter.set_tensor(input_details[0]['index'], input_data)

interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])

It's worth noting here that, depending on how costly it is to call your model, you can change what metric we're optimizing for when you train the on-device model. For example, precision over recall in case we don't care about False Negatives.

The main problem with quantization is that it loses a bit of accuracy: kinda equivalent to adding noise to the weights and activations. If the accuracy drop is too severe, then we may need to use quantization-aware training. This means adding fake quantization operations to the model so it can learn to ignore the quantization noise during training; making the final weights to be more robust to quantization.

Phase 2: Building the cloud model

Our cloud model doesn't really need to be bounded by any constraints we faced for edge optimized model. We can follow a more traditional approach for training, exporting, and deploying this model. This means we can combine multiple different design patterns, such as Transfer Learning, a cascade of models, or multiple different models depending on the second phase requirement.

After training, we can then deploy this model to a cloud AI service provider (AWS, GCP, etc). Or we can take complete pipeline-based model training and deployment setup using libraries like TFX.

To demonstrate, we'll pretend a model is already trained and then deploy it on Google Cloud AI Platform.

First, we'll directly save our model to our GCP project storage bucket:

cloud_model.save('gs://your_storage_bucket/path')

This will export our model in TF SavedModel format and upload it to Cloud Storage bucket

On Google Cloud AI Platform, a model resource contains different versions of your model. Each model can have hundreds of versions. We'll create the model resource using gcloud, the Google Cloud CLI.

gcloud ai-platform models create second-phase-predictor

Then to deploy our model, we'll use gcloud and point AI Platform at the storage subdirectory that contains our saved model assets:

gcloud ai-platform versions create v1 \
  --model second-phase-predictor \
  --origin 'gs://your_storage_bucket/path/model_timestamp' \
  --runtime-version=2.1 \
  --framework='tensorflow' \
  --python-version=3.7

Trade-Offs and Alternatives

There might be situations where our end users may have very little or no internet connectivity, and thus the services of the second phase/cloud hosted model becomes impossible to access. How can we mitigate this issue? Other than that how we are supposed to perform continuous evaluation, check if the metrics haven't degraded over time and if accuracy is suffering on edge-deployed model?

Standalone single-phase model

In situations where end users of our model may have little or no internet connectivity, instead of relying on a two-phase prediction flow, we can make our first model robust enough that it can be self-sufficient.

To do this, we can create a smaller version of our complex model, and give users the option to dowload this simpler, smaller model for use when they are offline. These offline models may not be quite as accurate as their larger online counterparts, but this solution is infinitely better than having no offline support at all.

To build more complex models designed for offline inference, it's best to utilize quantization aware training, whereby we quantize model's weights and other math operations both during and after training.

Offline support for specific use cases

Another solution for making our application work for users with minimal internet connectivity is to make only certain parts of our app available offline. This means only a few common features work offline or caching the results of an ML model's prediction for later offline use.

This way, the app works sufficiently offline but provides full functionality when it regains connectivity.

Handling many predictions in near real time

In some other cases, end users of our ML model may have reliable connectivity but might need to make hundreds or even thousands of predictions to our mode at once. This is the case of sensor stream data, maybe trying to detect some kind of anomaly.

Getting prediction responses on thousands of examples at once will take too much time due to the excess amount of requests and network bandwidth issues.

Instead of constantly sending requests over the network for anomaly detection, we can have a model deployed directly on the sensors to identify possible anomaly candidates from incoming data and then send only potential anomalies to our cloud model for verification.

The main difference being that both the offline and cloud models perform the same prediction task but with different inputs.

Continuous evaluation for offline models

We can save a subset of predictions that are received on-device. We could then periodically evaluate our model's performance on these examples and determine if the model needs retraining.

Another option is to create a replica of our on-device model to run online, only for continuous evaluation purposes. This solution is preferred if our offline and cloud models are running similar prediction tasks, like in Neural Machine Translation.

That's all for today. Hope you learned something new.

This is Anurag Dhadse, signing off.

Ensemble—A Bundle of ML Models

Anurag Dhadse — Sat, 17 Sep 2022 13:58:41 GMT

Wisdom of crowd.

Wisdom of crowds is a theory that assumes that the knowledge or collective opinion of a diverse independent group of individuals results in better decision-making, innovation, and problem-solving than that of an individual.

In the machine learning space, when it's harder to build a model that has a substantially lower reducible error, instead of building for larger models, we can combine several diverse ml models.

But what reducible error we are talking about?

You see, the error of an ML model can be broken down into two parts:

$$\text{Error of model} = \text{Irreducible error} + \text{Reducible error}$$

The irreducible error is the inherent error in the model resulting from noise in the dataset, bad training examples, or framing of the problem.

And the reducible error is made up of:

$$\text{reducible error} = \text{Bias}\space \text{\textbf{Or}}\space \text{Variance}$$

The bias is the model's inability to learn enough about the relationships between the model's features and labels. This is due to wrong assumptions such as the data is linearly separable when it is actually quadratic.

The variance captures the model's inability to generalize on new, unseen examples due to model's excessive sensitivity to small variations in the training data.

A model with high bias oversimplifies the relationship and becomes underfit, and a model with high variance learns too much (kind of cram everything) and is said to overfit.

Our task in modeling is to lower both bias and variance, but in practice, however, this is not possible. This is called as bias-variance trade-off.

Ensemble is a solution for this trade-off applied to small and medium-scale problems to reduce the bias and/or variance to help improve performance. This involves as stated above to combine multiple models and aggregating their outputs to generate the final result.

The most common techniques in Ensemble Learning are:

Bagging–good for decreasing variance
Boosting–good for decreasing bias
Stacking

Bagging

Bagging or bootstrap aggregating is a type of parallel ensembling method where the same training algorithm for every predictor and train them on different random subsets of training set with replacement.

When sampling is performed without replacement, it is called pasting.

Then aggregation is performed on the output of these models–either an average or majority vote in the case of classification.

This works because each individual model can be off by a random amount, so when their results are averaged, these errors cancel out.

We can also have hard-voting or soft voting when performing aggregation for classification models:

If the majority-vote classifier output is selected, then it is hard voting.
If the highest class probability from all classifiers are averaged for final selection of classification output, then it is soft voting.

Bagging is good for decreasing variance of the resulting ensemble model

A very popular example of bagging is random forest.

You can very easily create a Radom forest in Scikit-Learn as:

from sklearn.ensemble import RandomForestRegressor

# Create the model with 50 trees
RF_model = RandomForestRegressor(n_estimators=50,
                                 max_features='sqrt',
                                 n_jobs=-1, verbose=1)
                                 
# and fit the training data
RF_model.fit(X_train, Y_train)

To perform pasting just set bootstrap=False for RandomForestRegressor/ BaggingClassifier.

from sklearn.ensemble import BaggingClassifier

# Create the ensemble with 50 base estimators 
Bc_model = BaggingClassifier(n_estimators=50,
                             max_samples=100,
                             bootstrap=False
                             n_jobs=-1, verbose=1)
                                 
# and fit the training data
bg_model.fit(X_train, Y_train)

BaggingClassifier automatically performs soft voting if the base classifier can estimate class probabilities, i.e, it has predict_proba() method

Boosting

Boosting refers to any Ensemble method that can combine several weak learners to produce strong learners with more capacity than the individual models.

Boosting iteratively improves upon a sequence of weak learners training them sequentially, each trying to correct its predecessor.

Boosting works because at each next iteration, the model is punished to predict according to the residuals of the previous iteration.

The most popular boosting methods are AdaBoost (short for Adaptive Boosting) and Gradient Boosting.

Boosting is an effective method to reduce bias of the resulting classifier

Once again, in scikit-learn we can implement it as follows:

from sklearn.ensemble import GradientBoostingRegressor

GB_model = GradientBoostingRegressor(n_estimators=1,
                                     max_depth=1,
                                     learning_rate=1,
                                     criterion='mse')
                                     
# fit on training data
GB_model.fit(X_train, Y_train)

One important drawback of this sequential learning technique is that it cannot be parallelized. As a result, it does not scale as well as bagging or pasting.

Stacking

Stacking can be thought of as an extension of simple model averaging of k models trained on complete dataset but with different types/algorithms. More generally, we could modify the averaging step to take a weighted average of all outputs.

Stacking comprises of two steps:

Initially, initial models (typically of different types) are trained to completion on the full training dataset.
In the second step, a meta-model is trained using the initial model outputs as features whose task is to best combine the outcomes of initial models to decrease the training error. Again it can be any machine learning model.

Stacking works because it combines the best of both bagging and boosting.

The simplest form of model averaging averages model outputs or could be a weighted average of the outputs based on the relative accuracy of the individual models

Trade-Offs and Alternatives

Increased training and design time

The obvious downtime to ensemble learning is increased training and design time. In ensemble design patterns, the complexity increases since instead of developing one single model, we are trying to model k-model, or maybe of different types if we are using Stacking.

However, we should carefully consider the overhead of building such ensemble models is worth it by comparing its accuracy and resource usage with simpler models.

Dropout as bagging

Dropout is a very popular regularization technique in Neural networks where a neuron is "dropped" during an interation based on its dropout-probability during training. It can be considered as an approximation of bagging, as a bagged ensemble of exponentially many neural networks.

Although, it's not exactly the same concept.

In the case of bagging, the models are independent, while in the case of dropout, the parameters are shared.
In bagging, the models are trained to convergence on their resepective training set, while with dropout, the ensemble member model will only be trained for a single step.

Decreased model interpretability

For many production ML tasks model interpretability and explainability is important. Ensembles doesn't fullfill this requirement.

Choosing the right tool for the problem

It's also important to keep the problem we were trying to solve in the first place. So, it's important to keep in mind the bias-variance trade-off and select the right tool for your problem. Bagging if you want to reduce variance, Boosting to reduce bias otherwise Stacking.

Hope you learned something new.

This is Anurag Dhadse, signing off.

Random splitting is Wicked.

Anurag Dhadse — Sat, 10 Sep 2022 00:27:00 GMT

You would have probably seen something like this in every machine learning tutorial out there:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.33, 
    random_state=42)

But there is a problem, it is rare that the rows are independent.

Take, for example, if we were trying to predict the arrival delays of flights on a particular day, the instances/rows will be highly correlated. This can lead to leakage of information between the training and test dataset.

Plus, unless we set random_state the train_test_split will produce complete different splits every time it is run. This will pose a problem when we are trying to consider reproducibility in our machine learning workflow.

This is where Repeatable Splitting comes in handy. Repeatable splitting of the data that works regardless of programming language or random seeds. This also makes sure that correlated rows fall into the same split.

The solution is to first identify a column that captures the correlation relationship between rows. Then, we use the last few digits as input to a hash function on that column to split the data.

So, as in a time series dataset, where often the rows are correlated, we can use the date column and pass it to the Farm Fingerprint hashing algorithm to split the available data into required splits.

SELECT
  airline,
  feature_1,
  feature_2,
  feature_3,
  feature_4
FROM
  `timeseries-data`.airline_ontime_data.flights
WHERE
  ABS(MOD(FARM_FINGERPRINT(date), 10)) < 8 -- 80% for TRAIN

Here, we compute the hash using the FARM_FINGERPRINT function and then use the modulo function to find an arbitrary 80% subset of the rows.

This is now repeatable–because the FARM_FINGERPRINT function returns the same value any time it is invoked on a specific timestamp, we can be sure we will get the same 80% of data each time.

But, there are some considerations when choosing which column to split on:

Rows on the same date tend to be correlated. Correlation is the biggest factor in the selection of column(s) on which to split.
date is not an input to the model even though it is used as a criterion for splitting. We can't use an actual input as the field with which to split because the trained model will not have seen 20% of the possible input values for the date column if we use 80% of the data for training (say the date column, 80% of values would remain in the test set, 20% unseen).
There have to enough date values. A rule of thumb is to shoot for 3-5% the denominator for the modulo, so in this case, we want 40 or so unique dates.
The label has to be well distributed among the dates. To be safe, look at the distribution graph and make sure that all three splits have a similar distribution of labels.

Kolomogorov-Smirnov Test

To check whether the label distributions are similar across the three datasets, plot the cumulative distribution functions of the label in the three datasets and find the maximum distance between each pair.

The smaller the maximum distance, the better the split.

Trade-Offs and Alternatives

Single Query

We can have a single query to generate training, validation, test splits:

CREATE OR REPLACE TABLE mydataset.mytable AS
SELECT
  airline,
  feature_1,
  feature_2,
  feature_3,
  feature_4,
  CASE(ABS(FARM_FINGERPRINT(date), 10)))
       WHEN 9 THEN 'test'
       WHEN 8 THEN 'validation'
       ELSE 'training' END AS split_col
FROM
  `timeseries-data`.airline_ontime_data.flights

Random split

If the rows are not correlated, we can hash the entire row of data by converting it to a string and hashing that string:

SELECT
  airline,
  feature_1,
  feature_2,
  feature_3,
  feature_4,
FROM
  `timeseries-data`.airline_ontime_data.flights f
WHERE
  ABS(MOD(FARM_FINGERPRINT(TO_JSON_STRING(f), 10)) < 8

Duplicate rows will always fall in the same split. If that's not the behavior we want, add a unique ID to SELECT query.

Split on multiple columns

It might happen that a combination of multiple rows might be correlated, say the date and weather. In that case, we can simply concatenate the fields (creating a feature cross) before computing the hash.

CREATE OR REPLACE TABLE mydataset.mytable AS
SELECT
  airline,
  feature_1,
  feature_2,
  feature_3,
  arrival_airport,
FROM
  `timeseries-data`.airline_ontime_data.flights
WHERE
  ABS(MOD(FARM_FINGERPRINT(CONCAT(date, arrival_airport), 10)) < 8

If we split on a feature cross of multiple columns, we can use arrival_airport (or any other feature used in conjunction) as one of the inputs to the model, since there will be examples of any particular airport in both the training and test sets.

Repeatable sampling

If we wanted to create a smaller dataset out of a bigger one (say for local development), how would we go about doing it repeatable? If we have a dataset of 50 million examples and we want a smaller dataset of one million flights? How would we pick 1 in 50 flights, and then 80% of those as training?

What we cannot do is:

SELECT
  airline,
  feature_1,
  feature_2,
  feature_3,
  feature_4,
FROM
  `timeseries-data`.airline_ontime_data.flights f
WHERE
  ABS(MOD(FARM_FINGERPRINT(date), 50)) = 0
  AND ABS(MOD(FARM_FINGERPRINT(date), 10)) < 8

We shouldn't do!

We cannot pick 1 in 50 rows and then pick 8 in 10. Those rows which are divisible by 50 are also going to be divisible by 10.

What we can do however is:

SELECT
  airline,
  feature_1,
  feature_2,
  feature_3,
  feature_4,
FROM
  `timeseries-data`.airline_ontime_data.flights f
WHERE
  ABS(MOD(FARM_FINGERPRINT(date), 50)) = 0
  AND ABS(MOD(FARM_FINGERPRINT(date), 500)) < 400

In this query, the 500 is 70*10, and 400 is 50*8 (80% as training).

The first modulo picks 1 in 50 rows and the second modulo picks 8 in 10 of those rows.

For validation, you can change the query as:

  ABS(MOD(FARM_FINGERPRINT(date), 50)) = 0
  AND ABS(MOD(FARM_FINGERPRINT(date), 500)) BETWEEN 400 AND 449 -- (9*50)

Sequential split

In the case of time series models, a very common approach is to use sequential splits of data. The idea is to assign blocks or intervals of series data to various splits preserving the correlation among those examples in the individual split.

Sequential split of data is also necessary for fast-moving environments such as fraud detection or spam detection even if the goal is not to predict the future value of time series. The goal instead is to quickly adapt to new data and predict behavior in sooner future.

Another instance where a sequential slit of data is needed is when there are high correlations between successive times and we need to take seasonality into account. Take weather forecasts for example. Successive day's weather depends on the previous day's one and is affected year long.

To do a sequential split in this case, we'll take the first 20 days of every month in the training dataset, the next 5 days in the validation dataset, and the last 5 days in the testing dataset.

Stratified split

In the above example, it was required that the splitting needs to happen after the dataset is stratified. Means we needed to account for the distribution of individual category/type of examples to remain same in splits, matching the distribution in the complete unsplitted dataset.

The larger the dataset, the less concerned we have to be with stratification. Therefore, in large-scale machine learning, the need to stratify isn't very common unless in the case of skewed datasets.

Unstructured data

Performing repeatable splitting in the case of structured data is quite straightforward. In the case of unstructured data, we can perform the same by using metadata information.

It is worth noting that, many problems with poor performance of ML can be addressed by designing the data split (and data collection) with potential correlations in mind.

Hope you learned something new.

This is Anurag Dhadse, signing off.

Build in Public

Anurag Dhadse — Sat, 03 Sep 2022 16:08:33 GMT

This is the last year of my Bachelor's degree. And soon after that, I'll be joining the industry to cater to customers' needs.

While in turn earning some money. But that's not the end, is it?

The last three years made me reevaluate things I do, things I wake up for, and my motivation. And this reevaluation schedule isn't something I want to end anytime soon.

People out there work for money. I want to work for myself. Sure money does play a role.

While I work for myself, I want to sustain myself that way. Motivation comes when you do action, and put effort into something. One part of that motivation came as I learned to build things in public. Building in Public.

And of course, we are talking about sidekick software projects, but this can apply to a variety of different things: Content creation, learning a new skill, anything.

So why should you build in public?

First off, you are putting yourself in front of everyone

What that means not only you are going to put the best of yourself, but also grow up upon that as you progress.

You'll learn the ability to showcase your identity.

Let me tell you by my example. One of the first websites I created was a portfolio website for myself, written in pure HTML/CSS. And that sucked. This was probably in my first year of my bachelor's. After that, I moved on to building a much more blog site (this site) and even my own personal wiki.

You learn from your mistakes

As I moved on to learning various other tech stacks, Python, Django, I learned I can make websites faster if I use template CSS libraries like Bootstrap or Tailwind.

The result was Shopiva. An eCommerce website with a Backend, RESTful API, and fluent design.

And subsequently created this blogging site and a wiki. Learning from my mistakes, I worked on focusing on more important parts like functionalities, adaptability, and usability, and less on design. Learning what is much more important to complete the project, and leaving others for future self.

My wiki is basically a plain site with HTML/CSS just like my first web project, but unlike that, I used a static site generator called Hugo to generate the webpages right from markdown files. Halving my work, while I focus on the theme.

You teach others

As you embark on learning new and exciting, you always want to keep a check on your progress. In Software development, ideally, you'd want to document, but for side projects, even good comments across source code work fine.

But like any task, you are very likely to forget what you learned. Building in public might help you here. As you build things up, you'd expect others to read your code and ideally for your future self. In that case, you tend to make sure the code is at best readable, maintainable, and has proper comments for others to know without diving into the nitty-gritty code.

On the other hand, if you'd been building in private, on yourself, it's very likely that you'll code do your heart desires, and not for anyone to appreciate your "hard work".

Forgetting Curve. Source: Wikipedia

Or, you can even go on and write an article or a blog post or a YouTube video, documenting the things you tried, how to do it and what is required to fix. These are just a few of the methods of Active-recall, a technique to avoid falling down the Forgetting curve.

If you don't actively recall what you've learned, how you've learned. You'll forget.

Making sure others see your work, and learning from you as you preach to them, brings the benefit of gaining the skill for a far longer period.

Others teach you

The open source community is ever so welcoming.

You'll hardly meet a jerk who'd come and say your code is shi*t. If you do, just ignore them.

There will always be someone better than you, writing better code, better in skill set. Opening up to the public can get you the attention of like-minded people. People who share the same interest and like doing things as their profession as well as a hobby.

Public and Open Source projects help them and you to learn about their way of doing things. People who are better than you will find the issues with you (or your work) and give valuable feedback.

If you are one of the most amazing developers, for sure you'll get your work used by many, probably even gaining more users than those pesky little propriety apps.

And after all that, you'll get appreciation, and recognition from the community. You don't always need to keep your work behind a paywall. If you can build trust, the community will even support you financially as you keep giving them better.

I hope that inspires you to build something cool and usable for the public.

This is Anurag Dhadse, signing off.

Demystifying Transfer Learning

Anurag Dhadse — Sun, 28 Aug 2022 06:31:31 GMT

Did you ever wonder why newborns pick up so much and so fast of behavior, language, and expressions in just a matter of few years?

Well, the reason is the genetic transfer of DNA and characteristics from parents to offspring. This means that the offspring isn't dumb at all at the time of birth (unlike our machines and ahem..hem AI). It already knows how to catch attention when hungry or smile on seeing familiar faces.

This transfer of intelligence is driven by years of evolution and is hard to match with the current state of Artificial Intelligence.

This explains the difference between human intelligence where a toddler can learn to differentiate between cat and dog just after glancing few images and can even learn more, whereas ML models once trained to identify certain concept can't absorb different concept without forgetting the previous one.

But we can learn something from nature. The transfer of knowledge. Albeit is a different fashion.

Introducing Transfer Learning.

In Transfer Learning, we take part of a previously trained model, freeze the weights, and incorporate these non-trainable layers into a new model that solves a similar problem, but on a smaller dataset.

To go along with the analogy with humans; it's kinda like taking an intellect's brain and fusing it with a toddler one so that the learning can begin where the intellect's ended. Ah, that sounds evil.

But in ML space, it's quite common and not so evil.

Let's take an example. Suppose you are tasked to build an ML model to classify between cats and dogs. If you read through my previous blog post on Checkpoints, we know that the model goes through 3 different phases during training:

In the first phase, training focuses on learning high-level organization of data.
In the second phase, the focus shifts to learning the details.
Finally, in the third phase, the model begins overfitting.

So, even before the model can begin to objectify the concept of cats and dogs, it has to go through the first phase of absorbing high-level organizing of data. That corresponds to making sense of the pixels, their color values, edges, and shapes in the images. This is why we need a huge corpus of data to generalize the high-level concept.

Large image, text datasets like ImageNet (with over 14 million labeled examples) and GLUE can help in many ML tasks and reach high accuracy due to their immense size. But most organizations with specialized prediction problems don't have nearly as much data available for their domain or is expensive to gather, as in the Medical domain where experts are required for an accurate labeling process.

We need a solution that allows us to build a custom model using only the data we have available and with the labels that we care about.

Understanding Transfer Learning

With Transfer Learning, we can take a model that has been trained on the same type of data for a similar task and apply it to a specialized task using our own custom data.

By same of type data, we mean the same data modality–images, text, and so forth. It is also ideal to use a model that has been pretrained on the same type of images. For example, if the end model gets input of cats/dogs from a smartphone camera, use images gathered from a smartphone camera.

By similar task, we're referring to the problem being solved. To do transfer learning for image classification, for example, it is better to start with a model that has been trained for image classification, rather than object detection.

Let's say we are trying to detemine if the given x-ray contains a broken bone or not. As this is a medical dataset, the size is small. Merely, 500 images for each label: broken and not broken. This obviously isn't enough to train a model from scratch, but we can use transfer learning to help a bit. We'll need to find a model that has already been trained on a large dataset to do image classification. We'll then remove the last layer from that model, freeze the weights of the model, and continue training using our 1000 x-ray images.

Ideally, we want the base model to be trained on a dataset with similar images to x-rays. However, we can still utilize transfer learning if the datasets are different, so long as the prediction task is the same. Which in this case is Image Classification.

The idea is to utilize the weights and layers from a model trained in the same domain as your prediction task. In most deep learning models, the final layer contains the classification label or output specific to your prediction task. So, we remove the layerand introduce our own final layer with the output for our specialized prediction task to continue training.

The penultimate layer of the model, the layer before the model's output layer is chosen as the bottleneck layer.

The "top" of the model, typically just the output layer is removed and remaining weights are frozen. The last layer of the remaining model is called the bottleneck layer.

Bottleneck layer

The bottleneck layer represents the inputs in the lowest-dimensionality space.

Let's try implementing it in TensorFlow and Keras for X-ray images to detect viral Pneumonia. We are going to use VGG19 pretrained model available in Keras applications module with pretrained weights of imagenet dataset.

vgg_model_withtop = tf.keras.applications.VGG19(
    include_top=true,
    weights='imagenet',
)

Model: "vgg19"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_2 (InputLayer)        [(None, 224, 224, 3)]     0         
                                                                 
 block1_conv1 (Conv2D)       (None, 224, 224, 64)      1792      
                                                                 
 ... more layer ...
                                                                 
 block4_pool (MaxPooling2D)  (None, 14, 14, 512)       0         
                                                                 
 block5_conv1 (Conv2D)       (None, 14, 14, 512)       2359808   
                                                                 
 block5_conv2 (Conv2D)       (None, 14, 14, 512)       2359808   
                                                                 
 block5_conv3 (Conv2D)       (None, 14, 14, 512)       2359808   
                                                                 
 block5_conv4 (Conv2D)       (None, 14, 14, 512)       2359808   
                                                                 
 block5_pool (MaxPooling2D)  (None, 7, 7, 512)         0         
                                                                 
 flatten (Flatten)           (None, 25088)             0         
                                                                 
 fc1 (Dense)                 (None, 4096)              102764544 
                                                                 
 fc2 (Dense)                 (None, 4096)              16781312  
                                                                 
 predictions (Dense)         (None, 1000)              4097000   
                                                                 
=================================================================
Total params: 143,667,240
Trainable params: 0
Non-trainable params: 143,667,240
_________________________________________________________________

Output of vgg_model_withtop.summary()

In this example, we choose the block5_pool layer as the bottleneck layer when we adapt this model to be trained on our Chest X-Ray Images dataset. The bottleneck layer produces a 7x7x512 dimensional array, which is a low-dimensional representation of the input image.

We hope that the information distillation will be sufficient to successfully carry out classification on our dataset.

Transfer Learning using VGG19 pretrained model

Since the model we are going to work with accepts images as 224x224x3 dimensional array, we need to either need to resize images to match this model input or change the model's input shape. Here we'll just go with resizing the input image.

vgg_model = tf.keras.applications.VGG19(
    include_top=False,
    weights='imagenet',
    input_shaoe=((224, 224, 3))
)

vgg_model.trainable = False

By setting include_top=False we're specifying that the last layer of the VGG we want to load is the bottleneck layer.

Note that setting include_top=False is hardcoded to use block5_pool as the bottleneck layer, but if we wanted to customize this, we could have loaded the full model, like we previously did, and deleted additional layers.

 block5_conv2 (Conv2D)       (None, 14, 14, 512)       2359808   
                                                                 
 block5_conv3 (Conv2D)       (None, 14, 14, 512)       2359808   
                                                                 
 block5_conv4 (Conv2D)       (None, 14, 14, 512)       2359808   
                                                                 
 block5_pool (MaxPooling2D)  (None, 7, 7, 512)         0         
                                                                 
=================================================================
Total params: 20,024,384
Trainable params: 0
Non-trainable params: 20,024,384
_________________________________________________________________

Updated model with no "top"

With keras.applications module, by setting input_shape parameter; can change the Layers's dimensions to accomodate for the new input dimension.

Well, do consider that, as a general rule of thumb, the bottleneck layer is typically the last, lowest-dimensionality, flattened layer before a flattening operation.

It is also worth noting that pre-trained embeddings can also be used in Transfer Learning. With embeddings, however, the purpose is to represent an input more concisely. Whereas with Transfer Learning the purpose is to train a similar model, that could be utilized for transfer learning.

Implementing transfer learning

We can implement transfer learning in Keras either by:

Loading a pre-trained model, removing the layers after the bottlneck layer, and adding a new final layer with our own data and labels.
Using a pre-trained TensorFlow Hub (https://tfhub.dev) module as the base for your transfer learning task.

Transfer Learning with pre-trained model

We have already set up our VGG model with a bottleneck layer. Let's add a few more layers to make our final model.

from tensorflow import keras

model = tf.keras.Sequential([
    vgg_model,
    keras.layers.GlobalAveragePooling2D(),
    keras.layers.Dense(2, activation="sigmoid")
])

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 vgg19 (Functional)          (None, 7, 7, 512)         20024384  
                                                                 
 global_average_pooling2d (G  (None, 512)              0         
 lobalAveragePooling2D)                                          
                                                                 
 dense (Dense)               (None, 2)                 1026      
                                                                 
=================================================================
Total params: 20,025,410
Trainable params: 1,026
Non-trainable params: 20,025,410
_________________________________________________________________

Our new model summary

As you can see the only trainable parameters are from the last layer (after bottleneck layer).

Had we wanted to use our own custom pre-trained model aside from what is offered in keras.applications, we would have done something like this:

model_A = keras.models.load_model("my_model_A.h5")
model_B_ontop_of_A = keras.models.Sequential(model_A.layers[:-1])
model_B_ontop_of_A.add(keras.layers.Dense(1, activation="sigmoid"))

model_b_ontop_of_A` uses all layers except the last one of model_A

Although, this method means the model_B_ontop_of_A and model_A shares some weight, and hence when training model_B_ontop_of_A will also affect model_A.

To avoid that, we need to clone the model_A's architecture with clone_model(), then copy its weights (since clone_model() does not clone the weights), and finally freeze the layers:

model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

# initialize `model_B_ontop_of_A` 
model_B_ontop_of_A = keras.models.Sequential(model_A.layers[:-1])

# Freeze weights
for layer in model_B_ontop_of_A.layers[:-1]:
    layer.trainable = False

Pre-trained embeddings with TF Hub

With TF Hub, we can very easily load a much larger variety of pre-trained models (called modules) as a layer, and then add our own classification layer on top.

hub_layer = hub.KerasLayer(
    "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1",
    input_shape=[], dtype=tf.string, trainable=False
)

And, add additional layers on top:

model = keras.Sequential([
    hub_layer,
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

Trade-Offs and Alternatives

Let's discuss the methods of modifying the weights of our original model when implementing transfer learning:

Feature Extraction
Fine-tuning

Fine-tuning vs Feature Extraction

Feature Extraction describes an approach to transfer learning where you freeze the weights of all layers before the bottleneck layer and train the following layers on our own data an labels.

In contrast, with fine-tuning we can either update the weights of each layer in the pre-trained model, or just a few of the layers right before the bottleneck.

One recommended approach to determining how many layers to freeze is known as progressive fine-tuning. This involves iteratively unfreezing layers after every training run to find the ideal number of layers to fine-tune. Also, it is recommended to lower down the learning rate as you begin unfreezing the layers.

Typically, when you've got a small dataset, it's best to use pre-trained model as a feature extractor rather than fine-tuning.

Criterion	Feature extraction	Fine-tuning
How large is the dataset?	Small	Large
Is your prediction task the same as that of the pre-trained model?	Different tasks	Same task; or similar task with same class distribution of labels
Budget for training time and computational cost	Low	High

Is Transfer Learning possible with tabular data?

Tabular data, however, cover a potentially infinite number of possible prediction tasks and data types. And so, as such currently Transfer Learning is not so common with tabular data.

That's all for today.

This is Anurag Dhadse, signing off.

Checkpoints. Not every ML model trains in minutes.

Anurag Dhadse — Sat, 20 Aug 2022 15:46:53 GMT

Yeah, not every ML model trains in minutes, at least not in Deep Learning space.

The more complex the model is, the larger the dataset is required to train it because the subsequent increase in parameters. This leads to taking longer to fit a batch, and hence longer training time.

In that case, it is good to think about measures to withstand the chances of machine failure during the training process. We don't want to begin from scratch when we have already done half of the work.

Checkpoints allow us to store the full state of the partially trained model (the architecture, weights) along with hyerparameters/parameters required to begin training from that point, periodically during the training process.

We can use this partially trained model as:

Final models (in case of early stoping, discussed later)
Starting point to continue training (machine failure and fine-tuning)

Checkpoints make sure to save the intermediate model state, as compared to exporting in which only the final model parameters (weights & biases) and architecture are exported. To begin retraining more information is required other than the above two. Take, for example, the optimizer that was used, with what parameters it was running, its state, how many epochs were set, how many were completed, and so on.

In Keras, we can create checkpoints using Keras callback, ModelCheckpoint passed to fit() method.

import time

model_name = "my_model"
run_id = time.strftime(f"{model_name}-run_%d_%m_%Y-%H_%M_%S")
checkpoint_path = f"./checkpoint/{run_id}.h5"
cp_callback = tf.keras.callbacks.ModelCheckpoint(
    checkpoint_path,
    save_weights_only=False,
    verbose=1)
  
history = model.fit(
    x_train, y_train,
    batch_size=64,
    epochs=3,
    validation_data=(x_val, y_val),
    verbose=2,
    callbacks=[cp_callback])

ModelCheckpoint allow us to save checkpoints after the end of each epoch. We can do checkpointing at the end of each batch, but the checkpoints size and I/O will add too much overhead.

Why it works

Partially trained models offer more options than just continued training. This is because they are usually more generalizable than the models created in later iterations.

We can break the training into three phases:

In the first phase, training focuses on learning high-level organization of data.
In the second phase, the focus shifts to learning the details.
Finally in the third phase, the model begins overfitting.

A partially trained model from the end of phase 1 or from phase 2 becomes more advantageous because it has learned the high-level organization but still hasn't dived into the details.

Trade-Offs and Alternatives

Early Stopping

Usually, the longer the training continues, the lower the loss goes on the training dataset. However, at some point, the rror on the validation dataset might stop decreasing. This is where overfitting begins to take place. This phenomenon is evident with the increase in the validation error.

Once overfitting begins, the validation error starts climbing up

It can be helpful to look at the validation error at the end of every epoch and stop training when the validation error is more than that of the previous epoch.

Checkpoint selection

It is not uncommon for the validation error to increase for a bit and then start to drop again. This usually happens because training initially focuses on more common scenarios (phase 1), then on rare samples (phase 2). Because rare situations may be imperfectly sampled between the training and validation datasets, occasional increases in the validation error during the training run are to be expected in phase 2.

So, we should train for longer and choose the optimal run as a preprocessing step.

In our above example, we'll continue training for longer. Load the fourth checkpoint and export the final model. This is called checkpoint selection and in TensorFlow can be achieved using BestExporter.

Regularizations

We can try to plateau both, validation error and training loss by adding L2 regularization to the model instead of the above two techniques.

Such a training loop is termed as a well-behaved training loop.

In an ideal situation, the validation error and training loss should plateau.

However, recent studies suggest that double descent happens in a variety of machine learning problems, and therefore it is better to train longer rather than risk a suboptimal solution by stopping early.

In the experimentation phase (when we are exploring different model architectures, hypertuning, etc), it's recommended that you turn off early stopping and train with larger models. This will ensure that model has enough capacity to learn the predictive patterns. At the end of experimentation, you can use the evaluation dataset to diagnose how well your model does on data it has not encountered during training.

When training the model to deploy in production, turn on early stopping or checkpoint selection and monitor the error metric on the evaluation dataset.

When you need to control cost, choose early stopping, and when you want to prioritize model accuracy choose checkpoint selection.

Fine-tuning

Fine-tuning is a process that takes a model that has already been trained for one given task and then tunes or tweaks the model to make it perform a second similar task. Since our checkpoint model, we can train on an already optimally performing model on small fresh data.

Resume from a checkpoint from before the training loss starts to plateau. Then train only on fresh data for subsequent iterations.

Starting from an earlier checkpoint tends to provide better generalizations as compared to final models/checkpoints.

Redefining an epoch

Epochs are easy to understand. It is number of times the model has gone over the entire dataset during training. But the use of epochs can leads to bad effects in real-world ML models.

Let's take an example, were we are going to train a ml model for 15 epochs using a TensorFlow Dataset with one million examples.

cp_callback = tf.keras.callbacks.ModelCheckpoint(...)
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15,
    batch_size=128,
    callbacks=[cp_callback])

The problem with this are:

If the model converges after having seen 14.3 million examples (i.e., after 14.3 epoch) we might want to exit and not waste any more computational resource.
ModelCheckpoint creates checkpoint at each epoch end. For resilience, we might want to checkpoint more often instead of waiting to process 1 million examples.
Datasets grows over time. If we get 1,00,000 more examples and we train the model and get a higher error, is it because we need an early stop or the data is corrupt. We can't tell because the prior training was on 15 million examples and the new one is on 16.5 million examples (15 million + 1,00,000 new examples * 15 epochs).
In distributed, parameter-server training the concept of an epoch is not clear. Because of potentially straggling workers, you can only instruct the system to train on some number of mini-batches.

Steps per epoch

Instead of training for 15 epochs, we might decide to train for 143,000 steps where batch_size is 100:

NUM_STEPS = 143_000
BATCH_SIZE = 100
NUM_CHECKPOINTS = 15
cp_callback = tf.keras.callbacks.ModelCheckpoint(...)

history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=NUM_CHECKPOINTS,
    steps_per_epoch=NUM_STEPS // NUM_CHECKPOINTS,
    batch_size=BATCH_SIZE,
    callbacks=[cp_callback])

It works as long as we make sure to repeat the train_ds infinitely:

train_ds = train_ds.repeat()

Although this gives us much more granularity, but we have to define an "epoch" as 1/15th of the total number of steps:

steps_per_epoch=NUM_STEPS // NUM_CHECKPOINTS

Retraining with more data

Let's talk about the scenario when we added 1,00,000 more examples. Our code remains same and processes 143,000 steps except that 10% of the examples it sees are newer.

If the model converges, great. If it doesn't we know that these new data points are the issue because we are not training as we were before.

Once we have trained for 143,000 steps, we restart the training and run it a bit longer. as long as model continues to converge. Then, we update the number 143,000 in the code above (in reality, it will be a parameter to the code) to reflect the new number of steps.

This works fine until you begin hyperparameter tuning. Let's say you changes the batch size to 50, then you'll only be training for half the time because the steps are constant (143,000) and each step is now will only take half as long as before.

Introducing Virtual epochs

The solution is to keep the total number of training examples shown to the model constant and not the number of steps.

NUM_TRAINING_EXAMPLES = 1000 * 1000
STOP_POINT = 14.3
TOTAL_TRAINING_EXAMPLES = int(STOP_POINT * NUM_TRAINING_EXAMPLES)
BATCH_SIZE = 100
NUM_CHECKPOINTS = 15
steps_per_epoch = (
    TOTAL_TRAINING_EXAMPLES // (BATCH_SIZE * NUM_CHECKPOINTS)
)
cp_callback = tf.keras.callbacks.ModelCheckpoint(...)

history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=NUM_CHECKPOINTS,
    steps_per_epoch=steps_per_epoch,
    batch_size=BATCH_SIZE,
    callbacks=[cp_callback]
)

When we get more data, first train it with the old settings, then increase the number of examples to reflect the new data, and finally change the STOP_POINT to reflect the number of times you have to traverse the data to attain convergence.

This will work even when we are doing hyperparameter tuning while retaining all the advantages of keeping the number of steps constant.

Hope you learned something wonderful.

This is Anurag Dhadse, Signing off.

Diving Into REST APIs

Anurag Dhadse — Sun, 14 Aug 2022 02:30:00 GMT

APIs or Application Programming Interfaces are magical creatures, I mean really; or at least to us, programmers.

Before I tell you more APIs, it's important to first briefly go through "Interfaces" first.

Interfaces are everywhere, in your smartphones as GUI (Graphical User Interfaces) based music player app offering to end users, or BASH shell offering CLI (Command Line Interfaces) to end users and even to programmers.

So what does an Interfaces do? Interfaces adds a layer of abstraction, hiding away the intricate details of what and how something works underneath. The users of interfaces don't need to know how music player app plays the music, or command line user to know how the command gets executed on press of a button.

That means, APIs bring abstraction but to different kind of users, they are not end users like in a music player example, but instead developers or Programmers (for 'P' in API).

For example if we are the programmer of the music player app, we might not need to implement code for gesture or play/pause feature.

This can be provided by an API of an SDK (Software Development Kit) provided by the platform.

You'll also find other libraries offering their APIs to aid in your work, like for e.g. TensorFlow, a mathematical computation framework written in C++ offers it's API in Python and Java as well.

But for most part when we talk about APIs, we are talking about Web APIs, and not an API of a library or an SDK, unless as told.

So, In this article we'll go through all about APIs, what they do, how they do, and what do we need to keep in mind as a developer when building our Web APIs.

What exactly is an API?

API or a Web API is an software interface accessibly via internet which offers a connection between your software and the software running on the server as a type of service.

This allows us to perform some operations on the server, or get some data from the server or both; performing operation and getting the result back from the server.

This can be especially useful when:

The client application needs some kind of service but don't have the capabilities/resources to do it. Say for example, Shazam (a popular music identification app) might need machine learning/Deep Learning services to run which the client side will not have resources to run them because of performance constraints.
or the service it reside in server and database needs to be accessed for CRUD operations (Create Read Update Delete). Like storing records of a user,

API developer exposes endpoints relative to domain on which the API is hosted where on passing a request to the endpoint, we'll get a response.

So for example, an endpoint might look something like this:

/reply
/posts/reply
from Spotify https://api.spotify.com/v1/albums/{id}/tracks
from twitter https://api.twiiter.com/2/users/{id}/liked_tweets

Making an HTTP request with a suitable HTTP method to an endpoint gets the process started.

A "Hello World" example of an API

What internally gets done depends upon your imagination. But let's create our own simple API using FastAPI framework in Python.

from fastapi import FastAPI

app = FastAPI()

@app.get("/")
def check_reply():
    return {"reply": "hello world"}

Running this app locally, when we visit localhost (http://127.0.0.1/), we'll be greeted with a JSON reply.

Couple of point to note here:

@app.get("/") decorator make sure that the method only runs when a request with GET method is made to / root URL.
JSON means JavaScript Object Notation, i.e., JavaScript's way of denoting an object. It's very similar to Python Dictionary and that's why we returned a dictionary with a key-value pair which FastAPI make sure to convert to JSON before sending the reply. This object can be anything you like, which could be JSON-ified.
There is another format in which a reply could be send, that is XML (EXtensible Markup Language) which used to be popular for sending raw text data over the internet before JSON came along.

Discovering JavaScript Object Notation with Douglas Crockford

Computer’s multimedia editor Charles Severance captures a video interview with Douglas Crockford on the creation of JavaScript Object Notation (JSON). From C...

YouTube

How JSON was discovered

When click on a link of a website, the HTTP request method that our browser creates is GET method.

And that's why we were able to get a response back.

Now what if instead of just getting some data, we want to upload some data too?

Well that requires us to change the HTTP method from GET to something like POST. Let's discuss HTTP methods first and what do they mean to our API.

HTTP Methods

Our API relies completely on HTTP protocol for communication. So which request needs to be deal with in which way is all defined by HTTP.

HTTP request methods indicate the desired action to be performed for a given resource identified by an endpoint.

There are variety of request methods but to us these five are most important when creating our own APIs.

GET method is used to read (or retrieve) a representation of a resource. Requests using GET should only retrieve data. A successful request is returned with an JSON/XML response with HTPP status code of 200 (0k) or in case of an error with 404 (Resource not found) or with 400 (Bad Request)
POST method is most-often utilized to create new resources, by submitting an entity to the specified resource, often causing a change in state or side effects on the server, like creating a new tuple in a database table. On successful creation, return HTTP status 201, returning a Location header with a link to the newly-created resource in a JSON/XML response.
PUT method is used for used for updating/replacing the current representations of the target resource with the request payload. On successful PUT request, return 200 (or 204 if not returning any content in the body).
PATCH method applies partial modifications to a resource. The might look similar to PUT, but in PATCH request the body contains a set of instructions describing how a resource currently residing on server should be modified to produce a new version instead of just a modified part of resource.
DELETE method deletes the specified resource identified by the URI.

A specific request should only lead to what the request method implies, nothing more, nothing less. For e.g., a GET method shouldn't update a resource, only fetch.

REST – Representational State Transfer

Today most APIs are designed to conform with REST design principles and is the most prominent architecture style that is used to design APIs today. REST is not a protocol, but rather a design philosophy that builds upon the principles of HTTP.

The reason for this dominance is the flexibility and freedom it provides for developers over other options such as SOAP or XML-RPC.

That's why APIs designed with REST in mind are often called RESTful APIs.

The only requirement being that the API needs to be written while abiding to these 6 architectural constraints:

Client-Server Architecture – The Client and server applications must be completely independent of each other. Client is only supposed to know about the URI the resource is located and can't interact with server application in any form. The server application too can't perform any kind of interaction other that providing the request resource.
Statelessness – The designed API should be stateless, meaning that each request needs to include all the information necessary for processing it, no server-side sessions.
Layered System – In REST APIs, the calls and responses go through different layers. So, API need to be designed so that neither the client nor the server can tell whether it communicates with end application or an intermediary
Cacheability – When possible resources should be cacheable on client or server side. This has to do with performance improvement on client side, while increasing scalability on server side.
Uniform Design – All API request for same resource should look same, no matter where the request comes from. So a REST API needs to ensure that same piece of data, such as name or email address of a user belongs to only one Uniform Resource Identifier (URI).
Code on Demand (optional) – REST APIs usually send static resources, but sometimes can also contain executable code, in which case the code should only run on-demand.

SOAP v/s REST

SOAP or Simple Object Access Protocol in contrast to REST is an XML-based protocol for making network API requests. Although it most is most commonly used over HTTP, it aims to be independent from HTTP and avoids using most HTTP features (like HTTP methods).

SOAP had a lot of rules. It comes with a sprawling and complex multitude of related standards that add various features. SOAP web service is described using an XML-based language called the Web Services Description Language, or WSDL.

WSDL is not designed to be human readable, and as SOAP messages are often too complex to construct manually, users of SOAP rely heavily on tool support, code generation and IDEs.

That means, even though SOAP and its various extensions offered standardization, interoperability between different vendors' implementation often causes problems. For these reasons, SOAP fall out of favor and REST grew up.

REST's main idea was for each piece of data the URL should stay the same, but operation would change depending on what method was used. For example request to "https://website.com/cart" with GET will return all cart item, but a POST request to the same URL would add an item to cart.

REST also also offers a greater variety of data formats rather than just sticking to XML, while most APIs default to JSON which offers better support for browser clients, faster parsing and works better with data. This means superior performance, particularly through caching for information that's not altered and not dynamic.

REST services are often described using a definition format such as OpenAPI. Let's save that topic for some other day.

Hope you learned something.

This is Anurag Dhadse, signing off.

Deliberately Overfitting your model

Anurag Dhadse — Sun, 07 Aug 2022 14:24:14 GMT

Remember those days of your life as an amateur ML enthusiast, celebrating when you trained your model on a toy dataset and received 100% accuracy!

Then you were introduced to the concept of Overfitting.

The problem occurs when a model starts to memorize the training data instead of generalizing it to new data. What you wanted was a generalized concept within a model but you got a rote learned model.

But, it's not always that bad. Sometimes you do intentionally want your model to Overfit.

Let's learn when you want to forget about the concept of generalization and accept the fate of rote learning.

The goal of almost all use case scenarios of machine learning is to generalize and learn the overall correlation of features with the label. If our model overfits the training data (the training loss keeps decreasing but the validation loss has started to increase) then the model's ability to generalize suffers and we don't get an effective model.

Random points and a regression line

However, in cases such as simulating the behavior of physical or dynamical systems like those found in climate science, computational biology, or computational finance. These systems are often described by a mathematical function or set of partial differential equations (PDE). Although the equations that govern these systems can be formally expressed, they don't have a closed-form solution, an equation is said to be a closed-form solution if it solves a given problem in terms of functions and mathematical operations from a given generally-accepted set.

Or in other terms, a closed-form expression is a mathematical expression that uses a finite number of standard operations. It may contain constants, variables, certain well-known operations (e.g., + − × ÷), and functions (e.g., nth root, exponent, logarithm, trigonometric functions, and inverse hyperbolic functions), but usually no limit, differentiation, or integration.

For example, the quadratic equation,

$$ax^2 + bx + c = 0$$

is tractable since its solutions can be expressed as a closed-form expression, i.e. in terms of elementary functions (no limit, differentiation, or integration):

$$x=\frac{-b\pm\sqrt{b^2-4ac}}{2a}$$

So, these dynamic systems instead use classical numerical methods to approximate solutions. Unfortunately, for many real-world applications, these methods can be too slow to be used in practice.

One such example of useful overfitting is when the entire domain of input data points and solutions is already tabulated and a physical model capable of computing the precise solution is available.

In such situations, ml models need to learn the precisely calculated and non-overlapping lookup table of inputs and outputs. Splitting such a dataset into the usual training-testing-validation split is also unnecessary since we aren't looking for generalization.

In this scenario, there is no "unseen" data that needs to be generalized, since all possible inputs have been tabulated.

Here, there is some physical phenomenon that you are trying to learn that is governed by an underlying PDE or system of PDEs. Machine Learning merely provides a data-driven approach to approximate the precise solution.

The Dynamic system that we are talking about here is a set of equations governed by some established laws–there is no unobserved variable, no noise, and no statistical variability. For a given set of inputs, there is only one precisely calculated output. Also, unlike other ml problems that suffer from probabilistic nature (like predicting rainwater amount), there are no overlapping examples in the training dataset. For this reason, we don't bother about overfitting our model.

You might ask, why not use an actual lookup table instead of using an ml model in these kinds of situations?

The problem is the training dataset can be too large (in size of Terabytes and Petabytes). Using an actual lookup table is just not possible in production settings. An ml model will be able to infer the approximate solution in a fraction of the time as compared to a lookup table or an actual physics model.

Why does it work?

The usual ML modeling involves training on data points sampled from the population. This sample represents the actual distribution of the data that we want to conceptualize.

When the observation space represents all possible data points, clearly we don't need the model to generalize. We would ideally want the model to learn as many data points as possible with no training error.

Deep learning approaches to solving differential equations or complex dynamical systems aim to represent a function defined implicitly by a differential equation, or system of equations, using a neural network.

Overfitting becomes useful when these two conditions are met,

There is no noise, so the labels are accurate for all instances.
You have the complete dataset at your disposal, overfitting becomes interpolating the dataset.

Alternatives and Use cases

Interpolation and chaos theory

The machine learning model we are trying to build here is essentially an approximation to a lookup table of inputs and outputs via interpolation of the given dataset. If the lookup table is small, just use a lookup table, there is no need to approximate it by a machine learning model.

Such interpolation works only if the underlying system is not chaotic. In chaotic systems, (suffering from probabilistic behavior) even if the system is deterministic, small differences can lead to drastically different outcomes.

In practice, however, each specific chaotic phenomenon has a specific resolution threshold beyond which it is possible for models to forecast it over a short period of time.

So, as long as the lookup table is fine-grained enough and the limits of resolvability are understood, useful approximations via ml techniques are possible.

Distilling knowledge of neural network

Another use case where overfitting comes useful is in knowledge distillation from a large machine learning model where its large computational complexity and learning capacity might not be fully utilized. While smaller models have enough capacity to represent the knowledge, they may lack the capacity to learn the knowledge efficiently.

In such cases, the solution is to train the smaller model on a large amount of generated data that is labeled by the larger model. The smaller model learns the soft output of the larger model, instead of actual hard labels on real data. This is similar to the above discussion, where we are trying to approximate the numerical function of the larger model to match the predictions.

The second training step of training the smaller model can employ useful overfitting.

Overfitting a batch

In the Deep Learning area, it is often preached to start with a complex enough model that can learn the dataset which has the ability to overfit. To generalize such a large model, we then employ regularization techniques such as Data augmentation, Dropout, etc to avoid overfitting.

A complex enough model should be able to overfit on a small enough batch of data, assuming everything is set up correctly. If you are not able to overfit a small batch with any model, it's worth rechecking model code, input and preprocessing pipeline, and loss function for any errors or bugs. This serves as a little checkbox when starting the modeling experimentation.

In Keras, you can use an instance of tf.data.Dataset to pull a single batch of data and try overfitting it:

BATCH_SIZE = 256
single_batch = train_ds.batch(BATCH_SIZE).take(1)

model.fit(single_batch.repeat(),
          validation_data=valid_ds,
          ...)

Note that we are apply repeat() so that we won't run out of data when training on that single batch.

That's all for today.

This is Anurag Dhadse, signing off.

Active Learning. An alternative to Lengthy Data Labeling process.

Anurag Dhadse — Sat, 30 Jul 2022 01:15:00 GMT

Suppose you are a Data Scientist who has solved many Data problems. Many of them were probably solved by creating models from readily, freely available data or maybe payed to the organization who owned it.

What if you came across a problem, where the data (and labels) are not available. Depending on the specific business requirement, you will have a talk with the Data engineering team and set up Data acquisition and Data Labeling.

And often this process becomes expensive depending on the domain (medical/industrial) and the amount of data that is required to be labeled.

For a few suitable scenarios, you can escape from labeling your entire data and get away with labeling a few specific examples and propagating them to the entire dataset. This saves money and time.

That's what Active Learning is.

Active learning is the process of prioritising the data which needs to be labelled in order to have the highest impact to training a supervised model.

But you may ask, why specific examples? Why not choose to label a random sample from the acquired data?

The problem lies with the quality of the labeled data.

Machine learning programs are decidedly effective at spotting patterns, associations, and rare occurrences in a pool of data. With randomly labeled data, the quality suffers and it becomes impossible for ML models to learn these complex patterns.

What we want the model to do is to grab essential complex properties about the dataset just enough that the performance becomes modest. Our goal should be to create a dataset that includes variations in each of our classes.

And get predictions from this modest model to get a much larger dataset training on which we'll have an even more reliable and performing model.

So for example, if we have new 10,000 data points, containing examples for 10 different classes, and we can label only 1000 of them, that's the budge (or if that number is suitable to create a modest performing model). We'll create a 1000 data points labeled dataset with the same amount of examples for each class and represent each variation possible in that class. We'll then label additional data after evaluating the generated model.

Let's go over a few common techniques of Active Learning.

All active learning techniques rely on us leveraging some number of examples with ground truth, and accurate labels. What they differ in the way they use these accurately labeled data points to identify and label unknown data points.

Pool-Based Sampling

Pool-based sampling is probably the most common technique in active learning, despite being memory intensive.

In Pool-based sampling we identify the "information usefulness" of all given training examples, and select the top N examples for training our model.

So for example, if we already have 1000 perfectly labelled data points, we can train on 800 labeled examples and validate on remaining 200 examples. The model so generated will now enable us to identify examples out of the rest 9000 unlabeled examples that going to be most helpful to improving performance. These examples will have lowest predicted precision.

We'll select top N examples out of these lowest predicted precision examples, and label them.

This new "important" data points of size N along with our previous 1000 will pave path to create a much more effective model.

Pool-Based sampling

Stream-Based Selective Sampling

Another active learning technique which is intuitive to understand.

In this technique, as the model is training, the active learning system determines whether to query for the perfect ground truth label or assign the model-predicted label based on some threshold set by us.

Unlike pool based it's not memory intensive but exhaustive search since each example is required to be examined one-by-one. This can easily exhaust our 1000 limit budget, if model don't get enough "important" examples soon enough and keep querying for true label.

Let's again take an example. We have a a moderate performing model trained using 1000 perfectly labeled data point and we want to increase performance on top of this. For that, we consider labeling 1000 more examples using stream-based selective sampling. We would go through the remaining 9,000 examples in our dataset one-by-one, ask model to evaluate. If the confidence is lower than the set threshold, ask labeller to assign label to it; otherwise leave the prediction output as generated label.

Stream-Based selective sampling

Evidently, we'll not go through all 9,000 examples since our labeling budget is capped at 1000 examples. So, the resulting dataset might not contain relatively very important data points as compared to Pool-based sampling.

Membership Query Synthesis

This is an active learning technique wherein we create new training examples based on already available data points. This might sounds spurious, but is actually plausible.

Analyzing the trend in the data and then careful use of regression or GANs can expand our starting training dataset. Data Augmentation is another technique often used in balancing datasets and regularizing can be used for generating new data points.

This technique is less limiting as compared to above two methods but requires a careful analysis of dataset and variations possible in the examples. If some variations in subset of examples are missing, then effective data augmentation technique will be required to fulfill that missing space.

Augly – Image augmentation library

For example, if we are building image classification model for sea creatures, it is possible that lighting conditions to be not so good and images to be often in lower light conditions. We can use brightness augmentation for this purpose. Or if images are sourced from screenshots in production, we can augment image to be a part of a fake screenshot, and so on.

Popular library for these kind of augmentations are for example –

The 1000 labeling budget becomes less of a concern with this technique, as no actual human labeler is required. But you can think of this budget getting spend whenever we synthesize a new example. And for sure you are free to exhaust your labeling budget in this technique.

That's all for today.

Thi is Anurag Dhadse, signing off