Nischal Ashok Kumar

Stable Baselines 3 Tutorial (Computerized Adaptive Testing)

2023-12-26T00:00:00-08:00

Figure 1: Figure showing the MDP

The goal of this blog is to present a tutorial on Stable Baselines 3, a popular Reinforcement Learning library with focus on implementing a custom environment and a custom policy. We will first describe our problem statement, discuss the MDP (Markov Decision Process), discuss the algorithms - PPO, custom feature extractor PPO and custom policy (lstm bilinear policy with PPO).

Problem Statement

In the realm of computer science education, evaluating coding homework poses unique challenges that traditional methods struggle to address. Auto-grading systems, which automatically assess correctness and functionality using test cases, offer a valuable solution. However, as the complexity of coding assignments grows, the sheer volume of test cases can hamper system efficiency. Inspired by Computerized Adaptive Testing (CAT), which optimizes assessments based on prior responses, we aim to develop a policy that selects a minimal yet effective set of test cases, ensuring swift and accurate evaluations for both students and instructors.

In Computerized Adaptive Testing (CAT), there are four essential components:

Knowledge Level Estimator: Assesses the student’s current knowledge based on their responses to previously selected items.
Response Model: Estimates the likelihood of a student answering a specific item correctly using the current knowledge level and item features.
Pool of Available Items: A collection of test items from which the adaptive system selects questions.
Item Selection Algorithm: Chooses the next most informative item based on the response model output.

The widely used Item-Response-Theory (IRT) model, specifically the simplest form (1PL), is often employed. In its basic form, the 1PL model is defined as:

\[P(Y_{i, j} = 1) = \sigma(\theta_j - b_i)\]

Here, \(\theta_j\) represents the student’s knowledge level, and \(b_i\) is the difficulty of item \(i\).

MDP Formulation

Notation

We denote the programming question statement as q; the solution (correct) code to the question as a; student j’s code as c_j; student j’s ground truth code quality (knowledge level) as θ_j; student j’s current code quality estimate after selecting kth test cases as θ̂_{j, k}; the total number of K test cases as t_{1, 2,…, K} and each test case’s feature as d_k; and the ground truth value of whether the student j’s code passes test case k or not as Y_{k, j}.

MDP Setting

In our MDP setting, the environment contains two models: 1. Large Language Model (LLM_e) 2.IRT-based model (IRT). The policy contains two models: 1. Large Language Model (LLM_p) 2.Time-distributed long Short-Term Memory model (LSTM). During training, we use the same frozen LLM for LLM_e and LLM_p for simplicity. Figure 1 shows our MDP.

State Definition

We represent our state as the LLM embeddings of the question statement, the solution code, and the test cases recommended so far minus the embeddings of the question statement, the student code, and the test cases recommended so far, S_k = LLM_e(q, a, t_1..k) - LLM_e(q, c_j, t_1..k).

Action Definition

Initially, We obtain a matrix Q containing the embeddings of all the test cases t_{1, 2,…,K} with LLM_p, which is used for selecting the action. At every time step k, we first calculate the current hidden state h_k using LSTM with the state S_k as the input. Next, we project h_k to the space of test cases Q using a bi-linear projection matrix W to obtain the logits L’. Finally, we apply the softmax function to the logits L’ to obtain a distribution over the test cases and choose the test case with the highest probability, which is A_k.

Q = {LLM_p(t_k)}_{k ∈ {1..K}}
h_k = LSTM(S_k)
L’ = h_kWQ^T
A_k = Argmax(L’)

Reward Definition

Initially, we apply the IRT model to learn the features of all test cases and all students’ ground truth code qualities by maximizing P(Y_{k, j} | θ_j, d_k), which are used to calculate the reward.

If the selected test case k has not been selected before, we use all the selected test cases to update student j’s current code quality estimate θ̂_{j, k} by maximizing the P(Y_{k, j} | θ̂_{j, k}, d_k). Then we calculate 1 / |θ_j - θ̂_{j, k}| as the reward, which is R_k. R_k measures how close the student j’s current code quality estimate is to the corresponding ground truth code quality. The smaller the difference, the bigger the reward. Otherwise, R_k = -10000. Giving a really big negative reward will force the agent to not select the same test case multiple times. We believe that the reward satisfies the Markovian property because it is solely based on the current state and action. Since the current state stores all the previously selected test cases.

MDP definition

In summary, our MDP is defined as follows:

S: LLM_e(q, a, t_1..k) - LLM_e(q, c_j, t_1..k)
A: the set of all test cases - {t₁..t_K}
P: P(LLM_e(q, a, t_1..k+1) | LLM_e(q, a, t_1..k), A_k) = 1.0
R:
- -10000 if t_k in {t₁..t_k-1}
- 1 / |θ_j - θ̂_{j, k}| otherwise
d₀: LLM_e(q, a) - LLM_e(q, c_j)
γ ∈ [0, 1), i.e., any value that is positive and smaller than 1.

Implementation

MDP

We implement our custom environment using OpenAI Gymnasium and Stable Baselines 3. We need to override methods like reset and step.

RL Algorithms

We implement three algorithms PPO, custom feature extractor PPO and custom policy (lstm bilinear policy with PPO).

PPO - We use the standard implementation of PPO using Stable Baselines 3.
Custom Feature Extractor - Instead of using the standard feature extractor we use a custom LSTM network to embed the state representations.
Custom Policy (LSTM Bilinear) - We modify the the policy to include the test case embeddings and a learnable bilinear layer that projects from the feature extraction (LSTM) space to the test case embeddings space.

The code and its explanation can be found in my Github Repository.Thanks to Wanyong Feng for collaborating with me on this project.

Applying to CS Ph.D. Programs

2022-08-10T00:00:00-07:00

Computer Science is one of the fastest-growing fields of study. Everyone has a different motivation for applying to graduate school, from progressing in their careers, learning advanced topics, and conducting research to exploring new places and opportunities. If you have decided to apply for CS Ph.D. programs, then you are in the right place. In this blog, we will talk about the process of application, tips to ace your application, and finally useful resources.

Disclaimer: The content of this blog reflects my opinion and may not be directly applicable to everyone. This is because people from diverse backgrounds and different stages of life apply to Ph.D. programs. I applied as a final year undergraduate CS student from India. My research area is ML and NLP and I applied to schools particularly in the United States.

Basic Pre-requisites

Here we will discuss the basic pre-requisites that one is expected to have accomplished for being eligible to apply for a Ph.D.

Standardized tests: Different programs across the world may prescribe a different set of tests as an eligibility criterion for applying. However, from the perspective of the US, the GRE and TOEFL are considered the most popular.
- GRE: The GRE typically contains three parts to it, verbal, quantitative, and AWA (essay writing section). The verbal and quantitative parts are each scored for a total of 170 points and the AWA section is scored for 6 points. Most schools in the US do not require GRE for Ph.D. programs however, they are mostly mandatory for MS programs.
- English Proficiency Test (TOEFL/ IELTS) : For most international students, an English proficiency test is required to prove that they can sustain academic life in the US. I took the TOEFL, and I believe that it is more popular than IELTS. The TOEFL has 4 sections, reading, listening, speaking, and writing each scored 30 points. Most universities set a cut-off value typically around 100 for the combined score to be eligible for applying. Also, several universities set special threshold scores for the sub-sections (for example UMass prescribes 26 in the speaking section).
- The standardized tests are merely used as a filtering criterion for Ph.D. students and do not play a major role in the application unless a candidate performs very poorly in it.
Undergrad Academic Records: Most universities take the undergraduate CGPA seriously while evaluating a candidate. Universities typically ask the students to provide both their overall CGPA and their major CGPA along with their class standings. Having a strong academic record (example top 5/10 in the class) can increase the chances of getting into the program, especially for students directly applying for their bachelor’s. A Master’s CGPA will be given more importance as compared to the undergrad CGPA for those students applying for Ph.D. positions after their masters. In all, if you are reading this blog in the early stages of your undergraduate studies, then it is recommended to maintain a great academic record.
Research Experience: Research experience is the most important part while evaluating Ph.D. candidates. Ph.D. is a research degree, and CS being a very competitive field, universities expect you to have a good amount of research exposure before the application. Research experience can be gained by either working in academic settings under professors or working in industrial research laboratories. Your experience will shape the kind of work you do in your graduate studies and help the university to match your profile with current professors working there. Having publications as a first author in reputed conferences/ journals is a huge plus point in the application. Having great research experience can be used to overcome any deficiencies in the application.

Process

Having known the basic pre-requisites for applying, let us look into the process of application.

Networking: Contacting Professors and Ph.D. students - You must be wondering why a separate section is required on this. Academia is a small community and most researchers know each other working in their domain. So, how is this going to help you? Having an idea of which place to apply and under which professor is very important. Basically, as a Ph.D. applicant, you are applying to that school to work particularly under 3-4 professors working in your area of interest. It is very important to know if the professors you are aiming to work with are hiring students this year. If yes, what has been their hiring pattern over the past few years, do they prefer students straight out of undergrad or more experienced candidates out of residency programs, etc? All of this can be known by either emailing the professor or their Ph.D. students. Most professors who are looking for graduate students will respond to your email indicating your chances of getting into their program and may even set up an interview to know you better. So, what are the ways of networking?
- Networking through conferences: Have you recently published a paper at a conference that is also attended by other members of the community? If so, make it a point to introduce yourself to your target professors and their students. Professors give huge importance to students who have already published in the venues where they generally publish. If you haven’t yet published at that conference but still have good research experience, you can work as a volunteer in the conference and get to talk to the grad students/ professors.
- Networking through mutual connections in the academia: If your current guide/ manager already knows professors in your area, then you can request them to introduce you to their connections who are hiring Ph.D. students. Again, this method is going to be very effective as direct recommendations within academia are considered very seriously.
- Networking on social media sites: Lastly, if none of the above points apply to you, you can still reach out to grad students on social media platforms like LinkedIn and Twitter with your questions. Most people would be happy to help.
- It is very important to know about the professor under whom you will be applying. Sometimes the professor of your choice may not be hiring graduate students for that year and you mentioning their name in the application may not help much.
SOP: Statement of Purpose is another important part of the application. Typically, universities allow students to write an SOP worth two pages of content with a font size of 12Pt. The SOP is your chance to talk to the admissions committee. You must elaborate on your motivation in your area of research. You must talk in detail about your past research experiences and how they have shaped your decision to pursue a Ph.D. It is fine for the Ph.D. SOP to have details and technical jargon - research-dense SOPs are generally given a preference. It is also very important to describe how your current research and plans align with the professors working in the university and the academic community in general.
LOR: Letters of Recommendation are one of the most important parts of the application. This is the part that you have the least control over. Most universities ask for a minimum of three LORs. For PhDs, it is recommended to have at least two LORs from academia. The third can be from the industry (preferably from someone in a research position). The LOR writers vouch for your ability as a researcher, mention the challenges you overcame while working with them and your soft skills. Strong LORs from your advisors highly increase your chances of consideration. In general, the academic standings of the LOR writers are considered seriously while evaluating their recommendations. Getting a strong letter from someone well-known in the community gives a huge boost to the application. However, it is generally recommended to go with recommenders who know you well and can give you a strong recommendation as compared to asking for a LOR from a well-known person who does not know you much.
Interviews: Most professors interview potential students after reviewing their application. The interviews typically happen after the application deadline has passed i.e. during January. Based on my experience, interviews are mostly an informal conversation about your research. Interviews serve as a purpose for the professors to generally gauge your research aptitude and understand if you are a genuine candidate. It is important to do background study about the professor and know about their current projects and publications. This will help you to connect your projects with theirs which shows greater research fit and hence higher chances of acceptance.

General Tips

Now, we will see a few tips for acing the application process.

Start early: University portals start accepting applications from as early as September when the deadline is typically mid-December. Do not leave anything for the last moment. Start networking about 6 months before the application season i.e. the beginning of the year. Utilize the summer to finish any mandatory tests (GRE/TOEFL) and prepare the first draft of your SOP. Also, do let your letter writers know in advance (during the summer) of your plan to apply for a Ph.D. This will have them mentally prepared to give you recommendations on time.
Be organized: Maintain an excel sheet containing the program details you are applying to along with other information like the deadlines and the requirements. Also, keep a note of two-three professors from each school with whom you have already networked/ with whom you would like to work. This will help you in writing the “fit” part of your SOP where you describe why you are a right fit for the school.
Get SOPs reviewed by seniors: Graduate students with whom you have contact/ your advisors will be happy to review your SOP and provide suggestions. This can help you correct mistakes that went unnoticed.
Help your recommenders: Some letter writers may ask for a short draft or a bullet-wise summary of your work with them. They do this generally to save them time and to make sure they include every good point about you. Apart from this, send your letter writers an excel sheet consisting of the program details you are applying to along with the deadlines. Also, make sure to remind your recommenders to submit the recommendation as the deadline approaches.
Participate in the Pre-application review program: Some universities like UMass, Brown, and Columbia have a separate program which pairs applicants with current Ph.D. students. The current students provide valuable advice about different parts of the application and also the school in general. Do check them out!

General Resources

Here are a few resources (blogs/ videos) that I found personally useful while applying:

Repository of advice:
Standardized Tests:
SOP:
LOR:
- Krishna Murthy’s advice
- Prof. Shriram’s advice
General Videos/ Podcasts:
After Getting Admits:
- Questions to ask your advisor on visit days

Summary

Applying for a Ph.D. program can be a daunting task. From deciding which schools to apply for to managing the SOPs and LORs is surely a stressful process. Having said that, approaching systematically by preparing well before the application can help ease the tension and guide you in the right direction. Hope the blog’s content and the resources linked here are useful to you. All the best for your application!

BERTology Transfer Learning in Natural Language Processing

2021-02-14T00:00:00-08:00


BERT: Google AI

Representation Learning is of prime importance in machine learning tasks; it is ultimately the input data representations that play a significant role in determining the model’s performance. There are scenarios where learning representations of the input data require transferring of knowledge from related but different tasks. This is where transfer learning comes in. Transfer Learning helps us utilize a more generalizable set of features and fine-tune them for the downstream task. To know more about transfer learning, please go through my blog linked here

In my recent NLP work, I have been using BERT and related models for certain downstream tasks like text classification on domain specific data. I am writing this blog with the motive of introducing you to the recent boom in transfer learning in natural language processing with the advent of architectures like BERT.

I will focus on the description of BERT and talk about using them in your works using Hugging Face.

Motivation


Transformers: Image from Jay Ammar

The 2017 OpenAI’s paper “Transformers: Attention is all you need” brought about a revolution in the field of deep learning in natural language processing by performing better than traditional recurrent neural network architectures like LSTMs and GRUs. In general, transformers are seen to handle long-term dependencies in the text better than LSTMs, which led to them performing significantly better than LSTM based architectures on tasks like Machine Translation.

Transformers are the reason for the birth of BERT and the series of other architectures that follow. Motivated by the developments in Computer Vision, NLP researchers were eyeing to create a model capable of generating better input text representations that could be used directly for downstream tasks.

Using the decoder layer of transformers, we could train it for language modeling, i.e., next word prediction task. We can train the stack of decoders of the Transformers on a massive corpus of data and allow it to learn the “context” of the language. This trained transformer decoder stack can then be used for obtaining better representations of input data for downstream tasks like text classification.

Entry of BERT


BERT Pre-Training and Fine-Tuning Procedure

Well, everything fine until now. But there is a catch. Transformer’s decoder architecture mentioned above doesn’t learn “bi-directional” contextualized embeddings, making it less sophisticated than ELMo (Bi-Directional LSTM based architecture for generating dynamic contextualized embeddings).

Well, BERT said why not combine the idea of both ELMo and Transformers?

BERT stands for Bidirectional Encoder Representations from Transformers. It uses the transformer encoder part and trains a “masked language model” randomly masking 15% of its input tokens. The task is to predict these tokens correctly.

BERT introduces another task in the pre-training – “The next sentence prediction” for incorporating sentence level knowledge. The task here is to determine if the second sentence follows the first one. For doing so, BERT introduces its special way of tokenization. BERT’s first token is the token, which contains information about the next sentence classification task. Apart from this, the two sentences are separated by a special token.

In addition to the single-sentence classification task and single sentence tagging task, this additional pre-training mechanism for next sentence prediction allows BERT to solve various problems. These include sentence pair classification tasks like Natural Language Inference tasks and question answering tasks.


BERT Paper showing different tasks

USING BERT

If you have been wondering when do I get to learn how to use BERT in my project, then thanks for hanging on until this stage. Here we will talk about two popular ways of using BERT: -

For Generating Contextualized Embeddings
For Fine-Tuning.

Before diving into it, let us set up our environment for running BERT.

We will use both Pytorch and Tensorflow 2.0 with Hugging Face for running BERT. Hugging Face is a NLP startup which release open source NLP state-of-the-art models which can be used by anyone.


Hugging Face

These models run with the support of deep learning libraries like Pytorch or Tensorflow 2.0.

Having known this, let us install the transformers module which contains the actual implementation BERT and related models. pip install transformers.

For installing Pytorch -> “pip install pytorch”

For installing Tensorflow 2.0 -> pip install tensorflow

(You could do conda install as well, depending on your virtual environment). (Note: Using anaconda is highly recommended).

Here, I will highlight the main steps involved in running BERT the full code is made available on GitHub along with a dataset to experiment on here.

TOKENIZATION:


Tokenization Procedure in BERT

After data pre-processing, tokenization is the first step of all natural language processing tasks. We break the sentences to words/ sub-word chunks and use this for generating features that can be used to train the model.

The Hugging Face Transformers comes with its own tokenizer which varies as per the BERT model you are using.

The entire process of tokenization can be further divided into three parts: -

Converting the sentence/ sentence pair into numerical tokens.
Padding the sequences to a maximum length.
Defining the attention masks and the token_type_ids

Attention masks – These indicate whether the token corresponds to a word or refers to padding.

For normal words 1 is used as masking number. For padding tokens 0 is used as the masking token.

Token_Type_Ids – This is a special id which marks the difference between the pair of sentences in case of a sentence pair input. The first sentence from the token till the is represented by 0 and the second sentence uptil is represented by 1.


Code Snippet for automatic Tokenization. For manual tokenization check my GitHub link

Having finished the tokenization phase we can now either train a classifier on top of the embeddings obtained from BERT or we could tune the entire BERT architecture for about 2-4 epochs. In general, it is observed that fine-tuning gives better performance over using the embeddings directly.

The entire code for this can be found on my GitHub repository here. Please do check it out.

The repository contains two codes -

bert_embeddings.py (Making use of embeddings for classification) (Coded in Hugging Face with Tensorflow 2.0)
bert_fine_tune.py (Fine tuning BERT on downstream task) (Coded in Hugging Face with Pytorch)

References and Acknowledgements:

Jay Ammar’s Blogs on Transformers and BERT
Transformers: Attention is All you need
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

IIT-Patna Academic Alexa

2020-12-22T00:00:00-08:00

In this post, I will take you through the Chatbot IIT-Patna Academic Alexa which is a sentiment-aware intelligent information retrieval based system for the academic portal of IIT-P.

Chatbots are one of the best applications of Natural Language Processing in improving human-computer interaction. The idea of the first chatbot was conceived in 1964, and since then, the development and usage of chatbots has sky-rocketed, owing to the advancements in computing, deep learning and deployable machine learning. Inspired from many such applications we build a novel chatbot for information retrieval for IIT-Patna’s academic data.

Motivation

Imagine you have a huge database consisting of the academic and personal details of students enrolled in your college/ organization. Now, we want to retrieve certain data like the marks of a student in a particular subject, or the credits of a subject, the number of subjects taught in a particular year etc. How do we generally do it? Simple! Just open it in a csv file and search for the parameters manually. As you have guessed that this will be a tedious and unfeasible task.

Now, what if we store the database in the form of MySQL relational tables and then write appropriate queries for performing the task? Sounds easy right? But is everyone computer literate? Can everyone write MySQL queries for retrieval of information for the data?

To address this task we set out to come up with a chatbot which takes Natural Language of the users as input and converts it into the appropriate SQL query and automatically fetches and displays the output. We design an entire chatbot pipeline for performing the given task.

Chatbot Pipeline

Components of the Custom-Designed Pipeline of IIT-P Academic Alexa: -

Input Module
Sentence Classifier
NL to SQL Engine
NLTK Regular Chat
Sentence Similarity Module
Feedback Module
Continual Learning System
Sentiment Analyzer

Sentence Classifier Module

Each query to the chatbot belongs to either database query or non-database query category.
Database queries are those which are addressed for information retrieval from the database.
Non-Database queries are those which conduct normal-conversation like Hi, How are you? Thank You etc.
Sentence Classifier Module is Deep Learning Based Binary Classifier for classifying as “Database” or “Non-Database” query
Multi-Channel Convolutional Neural Network Model coded in Tensorflow
Self-Prepared and annotated the train and the test set
Model used for real-time inferencing (lightweight for deployability)
Capable of adapting to Continual Learning (or Online Learning Environment)

Model Specifications

Motivation of using Multi-CNN

CNNs capture local features of the input
Input to sentence classifier in our case depends of local feature mapping
Local features in textual context can be visualized as n-gram based features
Inherit n-gram feature modelling in multi-cnn is hence useful.

The sentence classifier mentioned here is trained on the self-procured and annotated train set and evaluated on the test set. It acheives about 93% accuracy.

NL to SQL Engine

Our module: -

Converts Natural Language Data into SQL (Structured-Query-Language) for information retrieval from the database
Works on self-developed algorithms based on the “Dependency-Tree-Parser” of the natural language query
Works on questions of type: -
- List
- Which
- What
- How Many
- Who
Algorithm proposed here is extendable

For coming up with the idea, the following steps were performed: -

Carried out literature search on traditional methods for NL to SQL conversion
NL to SQL is an open-ended research problem in the field of lexical semantics and semantic parsing
Decided to use the “Syntax” and “Lexical Semantics” of the natural language data
Hypothesized about using Part-of-Speech-Tagging
Investigated the problem by using “Dependency Tree Parsers”
Developed a basic tree-parsing algorithm for the questions.
Implemented a DFS (Depth-First-Search) approach for certain type of questions

The NL-SQL part of the project was done by me and my classmate Vaibhav.

Sample pseudocode for “list question” conversion is given below

Implementation Details

Explored different options for libraries like nltk and spaCy
Decided to use spaCy library for obtaining the Dependency Tree of the natural language query
spaCy is
- A Free Open Source Natural Language Processing Library in Python
- Offers modules for NER, POS, Sentence Similarity and Text Classification
Reasons for using spaCy
- Lightweight and easy to use
- Easily integratable with python
- Provides robust results as compared to traditional text processing libraries.
spaCy offers the fastest syntactic parser in the world of accuracy within 1% of the best available (Choi et al., 2015)

Python SQL Connector

Part of the NL-SQL converter engine
Takes as input the SQL query generated from the natural language data
Establishes connection with the MySQL server in the local machine (where the database is stored)
Fetches and Processes the returned tuple structure for forming the appropriate output

Implementation Details:

Use mysql connector: a standard database driver provided by MySQL
Step 1: Authorization
Step 2: Creating a cursor pointing to the database
Step 3: Provide the query in form of string and process the returned output

NLTK Chat

Based on NLTK (Natural Language Toolkit) library
Chat Provision for Regular (Non-Database) queries like Hello, Thank You etc.
Based on simple regex (Regular Expression) matching of queries
Pairs variable coded by the programmer contains predefined regex query -> output mapping
Reflections variable contains pronouns mapping from question to answer
Lightweight and easy to use
Helps make the chatbot more interesting by incorporating daily life conversations

Feedback Module

Practical software systems cannot be infallible. Feedback is the means of striving towards perfection.
Two steps:-
- Sentence Similarity
- Data Collection

Sentence Similarity Module

Feedback Data Collection

Continuous data is collected from the users/ testers

Sentence Similarity module: -
- Misclassification of input query is tackled
- The specific text followed by the feedback label is recorded
- Feedback label is either database or non-database tag labelled by the user
NL-SQL Module:-
- NL-SQL misconversion is tackled
- The input query is recorded in a csv file and the file is forwarded to the code maintainer
NLTK Chat Module: -
- Tackles Regular chat which is not present in the pre-defined chat pair
- The input query is recorded in a csv file and the file is forwarded to the code maintainer

Continual Learning/ Online Training

Ability of a Machine Learning model to learn continually from a stream of data
Is of prime importance in production environments
User feedback is stored in a separate csv file
Sentence Classifier module learns from the collected feedback data at regular intervals of time
The classifier learns with time and hence becomes better

Sentiment Analysis Module

We carry out sentiment analysis of all users using the chatbot platform
Sentiment of students when chatting with the chatbot can give various insights
- Mental state of the student
- Overall satisfaction of group of students with the results/marks obtained

Implementation done using Vader-Sentiment-Analysis

VADER is a rule-based sentiment analysis tool which does lexicon based scoring
Sentiment scores are between -1 and 1
Sentiment score of:
- 0 is neutral
- Less than -0.05 is negative
- Greater than 0.05 is positive
Sentiment of all sentences are averaged to determine the overall sentiment of the user

Hence this completes the summary of all the modules used in the chatbot pipeline.

This was a team project and here in this blog I have taken you through my part of the project. There is much more to this like the final interface, data analysis etc. which was done by my classmates at IIT-Patna.

You can find the code for this project on GitHub here.

Stay tuned for more ML and DL content!

Unlabeled Data for Adversarial Robustness

2020-07-20T00:00:00-07:00

Effect of adversarial perturbations on natural input - Misclassification of an image (which hasn’t changed in human’s perspective)

This blog post talks about using Unlabeled data for improving Adversarial Robustness in deep neural networks. This post is a summary of the following works: -

Using Pre-training can improve model robustness (ICML 2019)
Unlabeled data improves adversarial robustness (NeurIPS 2019)
Are labels required for improving adversarial robustness? (NeurIPS 2019)

Brief Precursor to using unlabeled data for adversarial robustness

Adversarial Robustness, in general, deals with the question whether we can develop classifiers that are robust to (test time) perturbations of their input, by an adversary intending to fool the classifier. This is of prime importance in critical applications of deep learning like cancer recognition systems, self-driving cars etc. where the scope of error is almost nil. In the recent years there has been significant developments in creating both adversaries and defenses against them.

Schmidt et al had showed in their work that there is a sample complexity gap in achieving the same robust accuracy as clean accuracy for a classification task using cifar-10 as the dataset.

Hence from his work we can conclude that there is a need for additional data for improving adversarial robustness. This worked served as a motivation to the papers we will be discussing in this blog. The following researchers started wondering how to make up for this additional data to bridge the sample complexity gap.

Using Pre-Training Improves Adversarial robustness

This paper introduces the concept of Adversarial Pre-training. This is based on the following concepts: -

The problem of requirement of more task specific data can be solved using pre-training (a typical transfer learning scenario)
Data from a different distribution can be beneficial for a different task (Huh et al)

Method used

Adversarial pre-training on downsampled (to 32X32 size) ImageNet(1000 class) dataset with 10 step PGD with eps = 8/255 (l-infinity).
Fine tuning with Cifar-10 dataset using pgd-10 with eps = 8/255 for 5 epochs
Finally, evaluating with cifar-10 using pgd-20 with eps = 8/255 (l-infinity) Note - They use wrn-28-10 for all their experiments

Results

From the results obtained by them we can observe that the clean accuracy has almost remained the same whereas the adversarial accuracy has significantly increased (by 12%). Hence from this work we can conclude that adversarial features can robustly transfer across data distributions.

Unlabeled Data improves Adversarial robustness

This paper addresses the following questions -

How can we account for additional data for improving robustness?
How to get additional labelled data? Labelling may be an expensive process

The solution to the questions is a Semi-Supervised Adversarial Training Algorithm

Here they propose an algorithm Robust Self Training (RST) which is based on : -

Taking unlabeled data and generating pseudo labels from them (using a network pre-trained with the labeled data)
Mixing the unlabeled data and the labeled data in a definite proportion.
Performing adversarial training (TRADES) on this dataset.

The reasoning for using unlabeled data for bridging the sample complexity gap is: -

Labeling Data is generally an expensive and tedious process.
Adversarial Robustness requires the predictions to be stable around naturally occurring inputs. Achieving this doesn’t really require labels.

RST Algorithm pseudocode

Here, Lstandard is the cross-entropy loss and Lrobust is the KL loss (as used in TRADES)

Note - Here cifar-10 is used as the labeled dataset containing 50K training samples. The unlabeled data is procured from the 80M Tiny Images dataset following a definite procedure. Total 500K unlabeled data is procured.

Results

The results are reported using wrn-28-10 with learning_scheduler cosine and unsupervised_fraction - 0.5 (i.e. each epoch contains half labeled and half unlabeled data)

We can observe that the RST model performs better than the other models on all the attacks, also it gets the highest accuracy on clean samples as well (89.7%)

Are Labels Required for improving Adversarial Robustness

Like the previous work, this work also focuses on using unlabeled data for improving adversarial robustness.

The main contributions of this work are: -

Proposed UAT (Unsupervised Adversarial Training) method to make use of unlabeled data.
Proved that unlabeled data can be competitive to labelled data for bridging the sample complexity gap in Schmidt et al.
New state-of-art on CIFAR-10 using uncurated unlabeled data.

The motivation behind this work is: -

Labelled data is expensive
Adversarial robustness depends on the smoothness of the classifier which can be determined by unlabeled data.
Only a small amount of labelled data is needed for standard generalization.

Here, they propose three variants of the UAT algorithm and experiment with all three of them.

These loss notations are useful for understanding the pseudocodes below. As from the losses, both the losses are making use of the the adversarial perturbed input (by considering the l-infinity norm ball)

Algorithm UAT-OT

Algorithm UAT-FT

Algorithm UAT++

Note the UAT++ algorithm is very similar to the RST algorithm of the previous paper except that in UAT++ both the losses (entropy and KL) are making use of the the adversarial perturbed input (by considering the l-infinity norm ball), whereas in RST only KL minimizes with adversarial input.

Also, here the unlabeled dataset is taken from the same labeled dataset but without considering the labels.

Here, we can see that the UAT++ algorithm performs as good as a supervised-oracle (a model which is trained with labels for the unlabeled data also). This proves that unlabeled data is competitive to labeled data for adversarial robustness.

Conclusion

Hence we have seen how unlabeled data can be used to improve adversarial robustness by studying the above three papers. For more details regarding this it is recommended to read the original papers. For a general idea about adversarial robustness one can refer here.

Stay tuned for more ML and DL content!

Mathematics for ML and DL

2020-05-26T00:00:00-07:00

Ever wondered what goes on in neural networks? Ever tried to calculate the gradients of the loss function after each pass? If yes, then you must have realized the role played by mathematical concepts here. If no, then you are at the right place.

In this blog post we will see the topics of Mathematics that are the most crucial for understanding and appreciating machine learning algorithms. Machine Learning has become very popular nowadays with developers and undergraduate college students, given the plethora of courses out there and the ease of building and running neural networks (thanks to google colab!). But, how many of these “machine-learners” actually understand the maths behind what is going on? Is it important? Well, not so much if you just want to play around with it, but an emphatic yes if you want to pursue research projects or come up with your new ideas.

Having said this, let’s dive into our topics.

Linear Algebra

Linear algebra is the study of vectors (n-dimensional in general) and the operations associated with them. In the context of Machine Learning, we use vectors, matrices and tensors to concisely represent data. By doing so, we can eliminate confusions and write the equations occurring in the ML models in a succinct and short-hand form.

The above image represents the computations taking place in a single neuron. We are calculating the weighted sum of the inputs and passing it through a non-linearity (function f() here).

We can observe the compact mathematical representation of the equation in the above image. Here b + x.w is obtained by matrix multiplications and addition which is a topic under Linear algebra.

Now, let’s look at a bigger picture of the entire neural network from LA’s perspective (LA - Linear Algebra and not Los Angeles 😉).

Consider the below NN architecture with 3 units in the input layer, 4 each in 2 hidden layers and 1 unit in the final layer (looks familiar 🤔- hmmm maybe of a binary classification task)

Can we represent the final output in a single line as a function of the input? (Hint - The diagram has weight matrices). The answer is yes. At each layer if we apply the formula shown above and combine all the three then we are done. Seems easy, isn’t it? But how is this going to help? Well, this representation comes handy when we are calculating gradients of the loss function with respect to the network parameters during backpropagation, which we will look at next.

Also, the concepts of basis, eigen-values, eigen-vectors, singular-value-decomposition are very crucial (Ex: they form the pillars of PCA (principal component analysis) which is a very important concept in Machine Learning (we won’t be discussing it here)). Having seen this let’s move on to the next topic i.e. Multivariate Calculus.

Multivariate Calculus

Wikipedia defines Calculus as the mathematical study of continuous change and so it is. Most of us are familiar with the concepts of integration and differentiation in calculus but what role do they have to play here? As said, we can view ML models as a computational box where the inputs is undergoing changes in some stages and finally changing to the output (a complicated way of saying output is some function of the input 😉). Now, in the training process our goal is to reduce the loss (or to optimize the model on the training data). This is where the theory of continuous change comes.

For reducing the loss function we have to tweak the weights and biases (collectively the parameters of the model), more specifically we have to move “opposite to the direction of the gradient”.

Consider this graph (here error surface), let us assume the z-axis represents the error and the x and y axes represent the parameters of the model. Now for reducing the error (or loss), we have to move in the x-y plane in the direction opposite to the gradient of the loss function wrt the parameters x and y (note - I’m not discussing the proof here). This commonly involves vectorial differentiation and the chain-rule over the loss function to some kth layer. Thankfully, we don’t have to perform it ourselves ( we have Tensorflow and pytorch!). However, it is important to know about it.

The image below shows the chain rule of differentiation.

Having a general idea of Calculus helps while carrying out research and building new models from scratch. Having said this, let’s move on to the final topic Probability and Statistics.

Probability and Statistics

the bedrock of ML

Probability is the study of uncertainty. It is a science that quantifies the likely-hood of the occurrence of the events of interest in a given setting. But, how is it related to ML? Well, isn’t ML about developing predictive models from uncertain data? Here is where Probability comes in. Probability is used almost everywhere in ML from defining cross-entropy as the loss function of a classification task, sampling data from specific gaussian distributions, usage in algorithms like Naive Bayes to study of Bayesian Machine Learning (a separate branch of ML based on Bayesian inferences). Having a clear understanding of Probability is a must if you want to excel in ML.

The formula above shows Kullback - Liebler Divergence to calculate the similarity between two probability distributions

Why is Statistics used along with Probability? Statistics is the study of data (includes its collection, organization and analysis). Mean, median, quartiles, variance, and standard deviation which are statistical topics are dealt with in the form of evaluation of expectation values of probability distribution functions along with their other properties.

Since we deal with huge data in ML models, the task of organizing them and preparing them to be fed to the model is dealt by statistics (Ex: We enforce batch normalization before feeding image data to deep CNNs (convolutional neural networks))

The above figure is of a common data distribution considered in ML (the gaussian distribution)

Prob-Stat is indeed the most important topic of maths used in ML.

Resources for Learning

Linear Algebra -
- Khan Academy
- 3 blue-1 brown
Calculus
Probability and Statistics
- Khan Academy
- Machine Learning Mastery Blog

Bonus

For book lovers

A complete course

Conclusion

This post helps the reader to understand why Mathematics is used in ML and how exactly. The explanations provided are a motivation for further learning. I have also provided Resources for each topic discussed.

For more ML and DL content stay tuned!

Nischal Ashok Kumar

Stable Baselines 3 Tutorial (Computerized Adaptive Testing)

Problem Statement

Related Work

MDP Formulation

Notation

MDP Setting

State Definition

Action Definition

Reward Definition

MDP definition

Implementation

MDP

RL Algorithms

Applying to CS Ph.D. Programs

Basic Pre-requisites

Process

General Tips

General Resources

Summary

BERTology Transfer Learning in Natural Language Processing

Motivation

Entry of BERT

USING BERT

TOKENIZATION:

IIT-Patna Academic Alexa

Motivation

Chatbot Pipeline

Sentence Classifier Module

Model Specifications

Motivation of using Multi-CNN

NL to SQL Engine

Python SQL Connector

NLTK Chat

Feedback Module

Sentence Similarity Module

Feedback Data Collection

Continual Learning/ Online Training

Sentiment Analysis Module

Unlabeled Data for Adversarial Robustness

Brief Precursor to using unlabeled data for adversarial robustness

Using Pre-Training Improves Adversarial robustness

Method used

Results

Unlabeled Data improves Adversarial robustness

RST Algorithm pseudocode

Results

Are Labels Required for improving Adversarial Robustness

Algorithm UAT-OT

Algorithm UAT-FT

Algorithm UAT++

Conclusion

Mathematics for ML and DL

Linear Algebra

Multivariate Calculus

Probability and Statistics

Resources for Learning

Bonus

Conclusion