Machine Learning, Puzzle, Python, Statistics

Puzzle of the Day Wednesday

1)

A decision tree model was trained on training data and it was giving 100% accuracy on the training set, but on testing set it was giving just 30% accuracy. What can be done to rescue from the above situation.

1 : Pre-pruning

2 : Post-pruning

3  : Decrease the tree depth by cutting from lowest node

  • 1 and 2
  • 2 and 3
  • All
  • None.

2)

If a variable A is caused by a variable B, what can you say about the correlation between them?

  • A and B will be highly correlated
  • A and B will be negatively correlated
  • A and B will be mildly correlated
  • A and B will not be correlated
  • A and B may or may not be correlated

3)

Which of the following is true for dropout rate?

  • All layers carry same dropout value.
  • Dropout rate of any layer depends on previous layer
  • Different layers can have different dropout rate
  • None of the above

4)

If i have to assess a student based on his marks and the type of subject and check the dependency of the marks scored with respect to the subject, what will i use?

Choose the correct option

  • Annova
  • Chi-square test
  • T-test
  • Paired T-test

5)

If a positively skewed distribution has a median of 50, which of the following statement is true?

A) Mean is greater than 50
B) Mean is less than 50
C) Mode is less than 50
D) Mode is greater than 50
E) Both A and C
F) Both B and D

POST YOUR ANSWER IN COMMENTS

Machine Learning, Python, Statistics

Puzzle of the Day

1) Given the following code:

from sklearn.metrics import roc_curve

y_true = [0, 1, 0, 0, 1, 1]

y_pred = [1, 1, 0, 0, 1, 0]

roc_curve(y_true, y_pred, pos_label=0)

Which of the following is true regarding the for result given by the above code?

  • Decreasing False Positive rate
  • Increasing True Positive rate
  • Decreasing True Positive rate
  • Increasing False positive rate

2)

Regularization is used to reduce the complexity of the model. Algorithms like lasso and ridge regression are used to perform regularization.

Adam and john are data scientist trainees at xyz company. Adam told john that one of the above algorithms also performs feature selection.

John said no there is no such algorithm. Choose the correct statement.

  • John is right these algorithms are used for regularization only
  • Adam is right, ridge regression performs feature selection internally
  • Adam is right, lasso regression performs feature selection as well
  • None of the above

3)

Which of the following parameters of SVM tells us about the tradeoff between misclassification and simplicity of the model?

  • Cost parameter
  • Gamma
  • Coefficients of kernel functions
  • None of the above

4)

Which of the following is a correct way to sharpen an image?

    1. Convolve the image with identity matrix
    2. Subtract this resulting image from the original
    3. Add this subtracted result back to the original image
    1. Smooth the image
    2. Subtract this smoothed image from the original
    3. Add this subtracted result back to the original image
    1. Smooth the image
    2. Add this smoothed image back to the original image
  • None of the above

5)

Which of the following is true for an SVM classifier with a ‘C’(cost) value very high?

  • It will have low variance
  • It will have high variance
  • It will have high bias
  • None of the above

POST YOUR ANSWERS IN COMMENT

Machine Learning, Pandas, Python

Encoding Categorical Variables

A machine learning model unfortunately cannot deal with categorical variables (except for some models such as LightGBM). Therefore, we have to find a way to encode (represent) these variables as numbers before handing them off to the model. There are two main ways to carry out this process:

  • Label encoding: assign each unique category in a categorical variable with an integer. No new columns are created. An example is shown below

image

  • One-hot encoding: create a new column for each unique category in a categorical variable. Each observation receives a 1 in the column for its corresponding category and a 0 in all other new columns.

image

The problem with label encoding is that it gives the categories an arbitrary ordering. The value assigned to each of the categories is random and does not reflect any inherent aspect of the category. In the example above, programmer recieves a 4 and data scientist a 1, but if we did the same process again, the labels could be reversed or completely different. The actual assignment of the integers is arbitrary. Therefore, when we perform label encoding, the model might use the relative value of the feature (for example programmer = 4 and data scientist = 1) to assign weights which is not what we want. If we only have two unique values for a categorical variable (such as Male/Female), then label encoding is fine, but for more than 2 unique categories, one-hot encoding is the safe option.

There is some debate about the relative merits of these approaches, and some models can deal with label encoded categorical variables with no issues. Here is a good Stack Overflow discussion. I think (and this is just a personal opinion) for categorical variables with many classes, one-hot encoding is the safest approach because it does not impose arbitrary values to categories. The only downside to one-hot encoding is that the number of features (dimensions of the data) can explode with categorical variables with many categories. To deal with this, we can perform one-hot encoding followed by PCA or other dimensionality reduction methods to reduce the number of dimensions (while still trying to preserve information).

Label Encoding and One-Hot Encoding

Let’s implement the policy described above: for any categorical variable (dtype == object) with 2 unique categories, we will use label encoding, and for any categorical variable with more than 2 unique categories, we will use one-hot encoding.

For label encoding, we use the Scikit-Learn LabelEncoder and for one-hot encoding, the pandas get_dummies(df) function.

 

Machine Learning, Python, Statistics

Puzzle-Day 4

  1. Let A and B be events on the same sample space, with P (A) = 0.6 and P (B) = 0.7. Can these two events be disjoint?

    • Yes
    • No
  2. Suppose we have a 1D image with values as 

    [2, 5, 8, 5, 2]

    Now we apply average filter on this image of size 3. What would be the value of the last second pixel?

    • The value would remain the same
    • The value would increase by 2
    • The value would decrease by 2
    • None of the above
  3. What does the assumption of homoscedasticity say?

    • The known value of the variance is the same for all the residuals, but it’s value is unknown
    • The variance is the same for all the residuals, but its value is unknown.
    • It’s assumed that the residual terms are independant of each other
    • None of these
  4. A data scientist is working on a binary classification problem, to classify a person as rich or poor. This dataset consists of only one feature, which is the wages of people in a country. The dataset consists of a large number of datapoints. Which of the following can be used for this classification problem?
    • Bernoulli Naïve Bayes
    • Multinomial Naïve Bayes
    • Power law naïve Bayes
    • None of the above
  5. Suppose I take a random variable X which represents heights of some people.

    X~N (μ, σ) and let mean=168 cm and standard deviation=10 cm and 50 observations.

    What will be the range of heights of probability having 90% confidence interval.

    • (165,180)
    • (159,177)
    • (155,181)
    • (162,174)

Please post your answer in comments.

Feature Extraction, Machine Learning, Python

SMOTE

SMOTE (Synthetic Minority Over-sampling Technique)

 

SMOTE is an over-sampling method. What it does is, it creates synthetic (not duplicate) samples of the minority class. Hence making the minority class equal to the majority class. SMOTE does this by selecting similar records and altering that record one column at a time by a random amount within the difference to the neighboring records.

imbalanced_over_sampling_SMOTE

SMOTE 2002

NearMiss (technique to do undersampling)

Near Miss is an under-sampling technique. Instead of re-sampling the Minority class, using a distance, this will make the majority class equal to minority class.

from imblearn.under_sampling import NearMiss

Data Set:

I’ll be using Bank Marketing dataset from UCI. Here the column for prediction is “y” which says either yes or no for client subscription to term deposit. The full code is available on GitHub.

from 45,211 records, we are left with 43,193 records. For the next step, I’ve mapped yes and no to “1” and “0” respectively.

bank.y.value_counts()

bank.y.value_counts()

0 38172
1 5021
Name: y, dtype: int64

The dataset contains 38172 records of clients without term deposit subscription and only 5021 records of clients with term deposit subscription. Clearly an imbalanced dataset.

If we split the dataset and fit a Logistic Regression and check the accuracy score:

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
accuracy_score(y_test, y_pred)
Out[1]: 0.8930456523752199


confusion_matrix(y_test, y_pred)
array([[9371,  173],
       [ 982,  273]], dtype=int64)

his is a bad model.

recall_score(y_test, y_pred)
Out[1]: 0.21752988047808766

Applying SMOTE:

 

pip install imblearn

from imblearn.over_sampling import SMOTE

y_train.value_counts()
0    28628
1     3766
Name: y, dtype: int64

Let us fit SMOTE: (You can check out all the parameters from here)

smt = SMOTE()
X_train, y_train = smt.fit_sample(X_train, y_train)

np.bincount(y_train)Out[48]: array([28628, 28628], dtype=int64)
accuracy_score(y_test, y_pred)
Out[1]: 0.8025743124363367

confusion_matrix(y_test, y_pred)
array([[7652, 1892],
       [ 240, 1015]], dtype=int64)
recall_score(y_test, y_pred)
Out[1]: 0.8087649402390438
Machine Learning, Statistics

Puzzle-Day 3

  1. A statistic student James was studying about the discrete distributions, he downloaded a data set that follows U(2,6) distribution. Answer the questions based on the given information. What is the mean of the data
    • 0
    • 6
    • 4
    • 2
  2. GPA scores of two samples of students are given below. Sample A is the scores of students who sleep for around 8 hours a day and Sample B is the scores of students who slept around 6 hours a day. Sample A: 5,7,5,3,5,3,3,9. Sample B: 8,1,4,6,6,4,1,2. The School tries to study the impact of sleep on scores. What are the degrees of freedom to test this hypothesis?

    • 7
    • 8
    • 14
    • 15
  3. Assume that a person wants to find out the probability of an accident occurring given that it has happened from a scooter.

    What will be the likelihood in this case?

    • The probability of a vehicle being a scooter
    • The probability of occurrence of an accident
    • The probability of a vehicle being scooter given that an accident has happened
    • None of the above
  4. In a Naive Bayes based recommendation system, if the ratings of a particular target user is predicted by calculating the conditional probability of a rating based on the ratings of other users on the same item, then which of the following is true?

    • The model is User-Based bayes model
    • The model is Item-Based bayes model
    • The model is a combination of A and B
    • Insufficient Information
  5. How can we find the goodness of fit in logistic regression?

    • z-value
    • P-value
    • Confusion matrix
    • All of the above

Post your answers in comment.

Machine Learning, Python

Naive Bayes Algorithm

Classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below:

naive bayes, bayes theorem

  • P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
  • P(c) is the prior probability of class.
  • P(x|c) is the likelihood which is the probability of predictor given class.
  • P(x) is the prior probability of predictor.

Pros:

  • It is easy and fast to predict class of test data set. It also perform well in multi class prediction
  • When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
  • It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

Cons:

  • If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.
  • On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.
  • Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
Statistics

Puzzle-Day2

  1. Bran was using Batch – Stochastic Gradient Descent (batch-SGD) on a dataset of 10 samples to implement an algorithm using dropout technique. Provided algorithm runs x times taking batch-size=2
    • 10/X
    • X
    • 2*X
    • 5
  2. The mode of a negatively skewed distribution is 50. Which of the below options has the possibility that it can be the mean of the same negatively skewed distribution.
    • 25
    • 50
    • 75
    • 100
  3. 5 of my friends were fighting that who will bat first in a cricket game, so we got a solution in which we made everyone to vote for each other and the rule was that no one will discuss or tell each other that who they voted.Which of the following ensemble method works similar to this problem?
    • Bagging
    • Boosting
    • Both
    • None
  4. If we increase the value of parameter c (loss) in SVM, what will happen?
    • It will tend to over fit
    • It will tend to under fit
    • No Effect
  5. Let x be the weight of a random population, x is normally distributed with an unknown mean and standard deviation=3. Take n=25, α=0.05.

    Z value as we know will be equal to 1.64.

    H(null): μ=150

    H(a): μ>150

    What is the sample mean?

    • 150.8
    • 151.4
    • 152.9
    • 155

Post your answers in comment.

Statistics

Distribution

Distribution of data based on Skewness:

Basically data distribution classified three types

  • Normal Distribution
  • Left Skew (Negatively skewed) Distribution
  • Right Skew (Positively skewed) Distribution

Image result for negatively skewed distribution

  • Normal Distribution – > Mean=Median=Mode
  • Negatively Distribution ->Mean<Median<Mode
  • Positive Distribution ->Mean>Median>Mode

 

Python, Statistics

T- Test

“You can’t prove a hypothesis; you can only improve or disprove it.” – Christopher Monckton

A t-test is a type of inferential statistic used to determine if there is a significant difference between the means of two groups, which may be related in certain features.the problem statement by assuming a null hypothesis that the two means are equal,

KEY TAKEAWAYS

  • A t-test is a type of inferential statistic used to determine if there is a significant difference between the means of two groups, which may be related in certain features.
  • The t-test is one of many tests used for the purpose of hypothesis testing in statistics.
  • Calculating a t-test requires three key data values. They include the difference between the mean values from each data set (called the mean difference), the standard deviation of each group, and the number of data values of each group.
  • There are several different types of t-test that can be performed depending on the data and type of analysis required.

Correlated (or Paired) T-Test

The correlated t-test is performed when the samples typically consist of matched pairs of similar units, or when there are cases of repeated measures. For example, there may be instances of the same patients being tested repeatedly—before and after receiving a particular treatment. In such cases, each patient is being used as a control sample against themselves. This method also applies to cases where the samples are related in some manner or have matching characteristics, like a comparative analysis involving children, parents or siblings. Correlated or paired t-tests are of a dependent type, as these involve cases where the two sets of samples are related.

 

Equal Variance (or Pooled) T-Test

The equal variance t-test is used when the number of samples in each group is the same, or the variance of the two data sets is similar. The following formula is used for calculating t-value and degrees of freedom for equal variance t-test

Unequal Variance T-Test

The unequal variance t-test is used when the number of samples in each group is different, and the variance of the two data sets is also different. This test is also called the Welch’s t-test. The following formula is used for calculating t-value and degrees of freedom for an unequal variance t-test:

 

probability Distribution types

T-Test link from AV