Python

Group the people by group size

There are n people whose IDs go from 0 to n - 1 and each person belongs exactly to one group. Given the array groupSizes of length n telling the group size each person belongs to, return the groups there are and the people’s IDs each group includes.

You can return any solution in any order and the same applies for IDs. Also, it is guaranteed that there exists at least one solution.

Example 1:

Input: groupSizes = [3,3,3,3,3,1,3]
Output: [[5],[0,1,2],[3,4,6]]
Explanation: 
Other possible solutions are [[2,1,6],[5],[0,4,3]] and [[5],[0,6,2],[4,3,1]].

Example 2:

Input: groupSizes = [2,1,3,3,3,2]
Output: [[1],[0,5],[2,3,4]]

Solution link github

Solution:

class Solution:
def groupThePeople(self, groupSizes: List[int]) -> List[List[int]]:
check = {}

res =[]
for i in range(len(groupSizes)):
s = groupSizes[i]
if s not in check:
check[s] = []
check[s].append(i)
if len(check[s])==s:
res.append(check[s])
check[s]=[]
return res

 

class Solution:
def groupThePeople(self, groupSizes: List[int]) -> List[List[int]]:
listind=groupSizes

listdif=[]
listdif1=[]
for i in range(0,len(listind)):
if(listind[i][1]==listind[i+1][1]):
listdif.append(listind[i][0])
#print(listind[i][0])

elif(listind[i+1][1]==listind[i-1][1]):
listdif1.append(listind[i][0])
#print(listind[i+1][0])
break
#elif(listind[0][1]==listind[i+1][1])

else:
#print(“””——-“””)
listdif.append(listind[i][0])
listdif.append(listind[len(listind)-1][0])
#print(listind[i][0])

def split_list(a_list):
half = len(a_list)//2
return a_list[:half], a_list[half:]

B, C = split_list(listdif)
print(listdif1,B,C)

Machine Learning, Python, Structure Thinking

Graph Algorithm

The listed graph algorithm, might we have used anyone in our projects or the social media (LinkedIn, Facebook) or the real time application like google maps and etc.

 

  1. Connected Components: in very layman’s terms as a sort of a hard clustering algorithm which finds clusters/islands in related/connected data.
  2. Shortest Path: is called Dijkstra is used extensively in Google Maps to find the shortest routes.
  3. Minimum Spanning Tree: We work for a water pipe laying company or an internet fiber company. We need to connect all the cities in the graph we have using the minimum amount of wire/pipe.
  4. Pagerank : It has been used for finding the most influential papers using citations. Has been used by Google to rank page
  5. Centrality Measures : Betweenness centrality quantifies how many times a particular node comes in the shortest chosen path between two other nodes.

Link: https://mlwhiz.com/blog/2019/09/02/graph_algs/?utm_campaign=data-scientists-the-5-graph-algorithms-that-you-should-know&utm_medium=social_link&utm_source=missinglettr

Machine Learning, Statistics

Central Limit Theorem

The Central Limit Theorem states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution.

mu_X=mu_x and sigma_X=sigma_x/sqrt(N).

Population mean is equal to the sampling distribution of sample mean.

Consider the dice scrolling the 5 times outcomes are { 2, 4, 2, 3, 6}

Mean of population = 3.4

Sampling Distribution of sample mean

Random sample : 4

Randomly selecting the values from population.

X1 = { 2,2,3,2} => Mean =2.25

X2 = {2,3,4,4} => Mean = 3.25

X3 = {3,4,6,2} => Mean = 3.75

If you increase the Random Sample size it will reduce the Standard Error.

Machine Learning, Python, Regularization

L1 and L2 Regularization Methods

Regularization in supervised learning models. In this post, let’s go over some of the regularization techniques widely used and the key difference between those.

In order to create less complex model when you have a large number of features in your data set, some of the Regularization techniques used to address over-fitting and feature selection are:

1. L1 Regularization ( Lasso Regression)

2. L2 Regularization ( Ridge Regression) 

A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression.

The key difference between these two is the penalty term.

Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. Here the highlighted part represents L2 regularization element.

Cost function

Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it will add too much weight and it will lead to under-fitting. Having said that it’s important how lambda is chosen. This technique works very well to avoid over-fitting issue.

Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of coefficient as penalty term to the loss function.

Cost function

Again, if lambda is zero then we will get back OLS whereas very large value will make coefficients zero hence it will under-fit.

The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.

Pandas, Python

Split the JSON – Python

Few columns will capture or transfer to table format as JSON tag.

While decoding the JSON take huge effort! below code set to help to split into new feature by python.

import json
extracted_event_data = pd.io.json.json_normalize(train.event_data.apply(json.loads))

 

def flatten_json(y):
out = {}

def flatten(x, name=”):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + ‘_’)
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + ‘_’)
i += 1
else:
out[name[:-1]] = x

flatten(y)
return out

from pandas.io.json import json_normalize
flat = flatten_json(che)
pd.set_option(‘display.max_colwidth’, -1)
json_normalize(flat)

For a sample of 100K rows, this code runs in ~12 sec in a Kaggle Kernel (resulting a DataFrame with 136 columns). That means that processing all train_df will require ~20 min.

 

Feature Extraction, Pandas, Python

New Feature Addition

Below code help you to include the new features from the existing Numerical and Categorical feature extraction and it might give a new feature to the model and help to increase the score for both classification and regression models.

While building next model give a try with this code inclusion.

cat_agg=[‘count’,’nunique’]
num_agg=[‘min’,’mean’,’max’,’sum’]
agg_col={
‘device_type’:cat_agg, ‘session_id’:cat_agg, ‘item_id’:cat_agg,’item_price’:num_agg,
‘category_3’:[‘count’,’nunique’,’mean’], ‘product_type’:[‘count’,’nunique’,’mean’]
}

for k in view_item.columns:
if k.startswith(‘category_1’) or k.startswith(‘category_2’):
agg_col[k]=[‘sum’,’mean’]
elif k.startswith(‘server’):
agg_col[k]=cat_agg
elif k.startswith(‘cumcount’):
agg_col[k]=num_agg
agg_col

view_item1=view_item.groupby(‘user_id’).agg(agg_col)

view_item1.columns=[‘J_’ + ‘_’.join(col).strip() for col in view_item1.columns.values]
view_item1.reset_index(inplace=True)
view_item1.head()

Simply aggregating the numerical and categorical feature with unique column and renaming column name with new feature names.

 

 

Machine Learning, Pandas, Python

Feature Engineering

  1. Converting the date time feature.
comp['start_date']=pd.to_datetime(comp['start_date'],format='%d/%m/%y',dayfirst=True)
comp['end_date']=pd.to_datetime(comp['end_date'],format='%d/%m/%y',dayfirst=True)
comp['diff_d']=(comp['end_date']-comp['start_date'])/np.timedelta64(1,'D')
comp['diff_m']=(comp['end_date']-comp['start_date'])/np.timedelta64(1,'M')
comp['diff_w']=(comp['end_date']-comp['start_date'])/np.timedelta64(1,'W')
tran['date']=pd.to_datetime(tran['date'],format='%Y-%m-%d')
tran['date_d']=tran['date'].dt.day.astype('category')
tran['date_m']=tran['date'].dt.month.astype('category')
tran['date_w']=tran['date'].dt.week.astype('category')

2. New Feature inclusion from existing feature.

tran['discount_bin']=tran['coupon_discount'].apply(lambda x: 0 if x>=0 else 1)
tran['marked_price']=tran['selling_price']-tran['other_discount']-tran['coupon_discount']
tran['disc_percent']=(tran['marked_price']-tran['selling_price'])/tran['selling_price']
tran['price_per_quan']=tran['marked_price']/tran['quantity']
tran['marked_by_sale']=tran['marked_price']/tran['selling_price']

 

3. Merging the two data frame with aggregation.

tran=tran.merge(tran.groupby(['customer_id','date']).agg({'coupon_id':'count','item_id':'count','disc_percent':sum}).reset_index().rename(columns={'coupon_id':'coupon_aquired','item_id':'item_bought','disc_percent':'tot_disc'}),on=['customer_id','date'],how='left')