Python – HuntDataScience

Group the people by group size

January 3, 2020January 6, 2020 Vinoth

There are n people whose IDs go from 0 to n - 1 and each person belongs exactly to one group. Given the array groupSizes of length n telling the group size each person belongs to, return the groups there are and the people’s IDs each group includes.

You can return any solution in any order and the same applies for IDs. Also, it is guaranteed that there exists at least one solution.

Example 1:

Input: groupSizes = [3,3,3,3,3,1,3]
Output: [[5],[0,1,2],[3,4,6]]
Explanation: 
Other possible solutions are [[2,1,6],[5],[0,4,3]] and [[5],[0,6,2],[4,3,1]].

Example 2:

Input: groupSizes = [2,1,3,3,3,2]
Output: [[1],[0,5],[2,3,4]]

Solution link github

Solution:

class Solution:
def groupThePeople(self, groupSizes: List[int]) -> List[List[int]]:
check = {}

res =[]
for i in range(len(groupSizes)):
s = groupSizes[i]
if s not in check:
check[s] = []
check[s].append(i)
if len(check[s])==s:
res.append(check[s])
check[s]=[]
return res

class Solution:
def groupThePeople(self, groupSizes: List[int]) -> List[List[int]]:
listind=groupSizes

listdif=[]
listdif1=[]
for i in range(0,len(listind)):
if(listind[i][1]==listind[i+1][1]):
listdif.append(listind[i][0])
#print(listind[i][0])

elif(listind[i+1][1]==listind[i-1][1]):
listdif1.append(listind[i][0])
#print(listind[i+1][0])
break
#elif(listind[0][1]==listind[i+1][1])

else:
#print(“””——-“””)
listdif.append(listind[i][0])
listdif.append(listind[len(listind)-1][0])
#print(listind[i][0])

def split_list(a_list):
half = len(a_list)//2
return a_list[:half], a_list[half:]

B, C = split_list(listdif)
print(listdif1,B,C)

Machine Learning, Python, Structure Thinking

Graph Algorithm

December 27, 2019 Vinoth

The listed graph algorithm, might we have used anyone in our projects or the social media (LinkedIn, Facebook) or the real time application like google maps and etc.

Connected Components: in very layman’s terms as a sort of a hard clustering algorithm which finds clusters/islands in related/connected data.
Shortest Path: is called Dijkstra is used extensively in Google Maps to find the shortest routes.
Minimum Spanning Tree: We work for a water pipe laying company or an internet fiber company. We need to connect all the cities in the graph we have using the minimum amount of wire/pipe.
Pagerank : It has been used for finding the most influential papers using citations. Has been used by Google to rank page
Centrality Measures : Betweenness centrality quantifies how many times a particular node comes in the shortest chosen path between two other nodes.

Link: https://mlwhiz.com/blog/2019/09/02/graph_algs/?utm_campaign=data-scientists-the-5-graph-algorithms-that-you-should-know&utm_medium=social_link&utm_source=missinglettr

Machine Learning, Python, Regularization

L1 and L2 Regularization Methods

December 18, 2019December 18, 2019 Vinoth

Regularization in supervised learning models. In this post, let’s go over some of the regularization techniques widely used and the key difference between those.

In order to create less complex model when you have a large number of features in your data set, some of the Regularization techniques used to address over-fitting and feature selection are:

1. L1 Regularization ( Lasso Regression)

2. L2 Regularization ( Ridge Regression)

A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression.

The key difference between these two is the penalty term.

Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. Here the highlighted part represents L2 regularization element.

Cost function

Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it will add too much weight and it will lead to under-fitting. Having said that it’s important how lambda is chosen. This technique works very well to avoid over-fitting issue.

Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of coefficient as penalty term to the loss function.

Cost function

Again, if lambda is zero then we will get back OLS whereas very large value will make coefficients zero hence it will under-fit.

The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.

Pandas, Python

Split the JSON – Python

November 29, 2019 Vinoth

Few columns will capture or transfer to table format as JSON tag.

While decoding the JSON take huge effort! below code set to help to split into new feature by python.

import json
extracted_event_data = pd.io.json.json_normalize(train.event_data.apply(json.loads))

def flatten_json(y):
out = {}

def flatten(x, name=”):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + ‘_’)
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + ‘_’)
i += 1
else:
out[name[:-1]] = x

flatten(y)
return out

from pandas.io.json import json_normalize
flat = flatten_json(che)
pd.set_option(‘display.max_colwidth’, -1)
json_normalize(flat)

For a sample of 100K rows, this code runs in ~12 sec in a Kaggle Kernel (resulting a DataFrame with 136 columns). That means that processing all train_df will require ~20 min.

Python

Maximize the Result set

November 28, 2019 Vinoth

Most of them will face the issue while seeing the output result from Juypter notebook columns will mess with next column.

To avoid the such pain the below simple code will help us.

pd.set_option(‘display.max_colwidth’, -1)

Python

Modin – Python

November 27, 2019 Vinoth

Modin is a python library that speeds up pandas by a single line of code:

import modin.pandas as pd

df = pd.read_csv(“my_dataset.csv”)

https://lnkd.in/dPv9_gd

Modin

Feature Extraction, Pandas, Python

New Feature Addition

November 8, 2019November 7, 2019 Vinoth

Below code help you to include the new features from the existing Numerical and Categorical feature extraction and it might give a new feature to the model and help to increase the score for both classification and regression models.

While building next model give a try with this code inclusion.

cat_agg=[‘count’,’nunique’]
num_agg=[‘min’,’mean’,’max’,’sum’]
agg_col={
‘device_type’:cat_agg, ‘session_id’:cat_agg, ‘item_id’:cat_agg,’item_price’:num_agg,
‘category_3’:[‘count’,’nunique’,’mean’], ‘product_type’:[‘count’,’nunique’,’mean’]
}

for k in view_item.columns:
if k.startswith(‘category_1’) or k.startswith(‘category_2’):
agg_col[k]=[‘sum’,’mean’]
elif k.startswith(‘server’):
agg_col[k]=cat_agg
elif k.startswith(‘cumcount’):
agg_col[k]=num_agg
agg_col

view_item1=view_item.groupby(‘user_id’).agg(agg_col)

view_item1.columns=[‘J_’ + ‘_’.join(col).strip() for col in view_item1.columns.values]
view_item1.reset_index(inplace=True)
view_item1.head()

Simply aggregating the numerical and categorical feature with unique column and renaming column name with new feature names.

Machine Learning, Pandas, Python

Feature Engineering

November 7, 2019 Vinoth

Converting the date time feature.

comp['start_date']=pd.to_datetime(comp['start_date'],format='%d/%m/%y',dayfirst=True)
comp['end_date']=pd.to_datetime(comp['end_date'],format='%d/%m/%y',dayfirst=True)

comp['diff_d']=(comp['end_date']-comp['start_date'])/np.timedelta64(1,'D')
comp['diff_m']=(comp['end_date']-comp['start_date'])/np.timedelta64(1,'M')
comp['diff_w']=(comp['end_date']-comp['start_date'])/np.timedelta64(1,'W')

tran['date']=pd.to_datetime(tran['date'],format='%Y-%m-%d')
tran['date_d']=tran['date'].dt.day.astype('category')
tran['date_m']=tran['date'].dt.month.astype('category')
tran['date_w']=tran['date'].dt.week.astype('category')

2. New Feature inclusion from existing feature.

tran['discount_bin']=tran['coupon_discount'].apply(lambda x: 0 if x>=0 else 1)
tran['marked_price']=tran['selling_price']-tran['other_discount']-tran['coupon_discount']
tran['disc_percent']=(tran['marked_price']-tran['selling_price'])/tran['selling_price']
tran['price_per_quan']=tran['marked_price']/tran['quantity']
tran['marked_by_sale']=tran['marked_price']/tran['selling_price']

3. Merging the two data frame with aggregation.

tran=tran.merge(tran.groupby(['customer_id','date']).agg({'coupon_id':'count','item_id':'count','disc_percent':sum}).reset_index().rename(columns={'coupon_id':'coupon_aquired','item_id':'item_bought','disc_percent':'tot_disc'}),on=['customer_id','date'],how='left')

Machine Learning, Puzzle, Python, Statistics

Puzzle of the Day Wednesday

November 6, 2019November 4, 2019 Vinoth

A decision tree model was trained on training data and it was giving 100% accuracy on the training set, but on testing set it was giving just 30% accuracy. What can be done to rescue from the above situation.

1 : Pre-pruning

2 : Post-pruning

3 : Decrease the tree depth by cutting from lowest node

1 and 2
2 and 3
All
None.

If a variable A is caused by a variable B, what can you say about the correlation between them?

A and B will be highly correlated
A and B will be negatively correlated
A and B will be mildly correlated
A and B will not be correlated
A and B may or may not be correlated

Which of the following is true for dropout rate?

All layers carry same dropout value.
Dropout rate of any layer depends on previous layer
Different layers can have different dropout rate
None of the above

If i have to assess a student based on his marks and the type of subject and check the dependency of the marks scored with respect to the subject, what will i use?

Choose the correct option

Annova
Chi-square test
T-test
Paired T-test

If a positively skewed distribution has a median of 50, which of the following statement is true?

A) Mean is greater than 50
B) Mean is less than 50
C) Mode is less than 50
D) Mode is greater than 50
E) Both A and C
F) Both B and D

POST YOUR ANSWER IN COMMENTS

Machine Learning, Python, Statistics

Puzzle of the Day

November 5, 2019November 4, 2019 Vinoth

1) Given the following code:

from sklearn.metrics import roc_curve

y_true = [0, 1, 0, 0, 1, 1]

y_pred = [1, 1, 0, 0, 1, 0]

roc_curve(y_true, y_pred, pos_label=0)

Which of the following is true regarding the for result given by the above code?

Decreasing False Positive rate
Increasing True Positive rate
Decreasing True Positive rate
Increasing False positive rate

Regularization is used to reduce the complexity of the model. Algorithms like lasso and ridge regression are used to perform regularization.

Adam and john are data scientist trainees at xyz company. Adam told john that one of the above algorithms also performs feature selection.

John said no there is no such algorithm. Choose the correct statement.

John is right these algorithms are used for regularization only
Adam is right, ridge regression performs feature selection internally
Adam is right, lasso regression performs feature selection as well
None of the above

Which of the following parameters of SVM tells us about the tradeoff between misclassification and simplicity of the model?

Cost parameter
Gamma
Coefficients of kernel functions
None of the above

Which of the following is a correct way to sharpen an image?

1. Convolve the image with identity matrix
2. Subtract this resulting image from the original
3. Add this subtracted result back to the original image
1. Smooth the image
2. Subtract this smoothed image from the original
3. Add this subtracted result back to the original image
1. Smooth the image
2. Add this smoothed image back to the original image
None of the above