Pandas, Python

Split the JSON – Python

Few columns will capture or transfer to table format as JSON tag.

While decoding the JSON take huge effort! below code set to help to split into new feature by python.

import json
extracted_event_data = pd.io.json.json_normalize(train.event_data.apply(json.loads))

 

def flatten_json(y):
out = {}

def flatten(x, name=”):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + ‘_’)
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + ‘_’)
i += 1
else:
out[name[:-1]] = x

flatten(y)
return out

from pandas.io.json import json_normalize
flat = flatten_json(che)
pd.set_option(‘display.max_colwidth’, -1)
json_normalize(flat)

For a sample of 100K rows, this code runs in ~12 sec in a Kaggle Kernel (resulting a DataFrame with 136 columns). That means that processing all train_df will require ~20 min.

 

Feature Extraction, Pandas, Python

New Feature Addition

Below code help you to include the new features from the existing Numerical and Categorical feature extraction and it might give a new feature to the model and help to increase the score for both classification and regression models.

While building next model give a try with this code inclusion.

cat_agg=[‘count’,’nunique’]
num_agg=[‘min’,’mean’,’max’,’sum’]
agg_col={
‘device_type’:cat_agg, ‘session_id’:cat_agg, ‘item_id’:cat_agg,’item_price’:num_agg,
‘category_3’:[‘count’,’nunique’,’mean’], ‘product_type’:[‘count’,’nunique’,’mean’]
}

for k in view_item.columns:
if k.startswith(‘category_1’) or k.startswith(‘category_2’):
agg_col[k]=[‘sum’,’mean’]
elif k.startswith(‘server’):
agg_col[k]=cat_agg
elif k.startswith(‘cumcount’):
agg_col[k]=num_agg
agg_col

view_item1=view_item.groupby(‘user_id’).agg(agg_col)

view_item1.columns=[‘J_’ + ‘_’.join(col).strip() for col in view_item1.columns.values]
view_item1.reset_index(inplace=True)
view_item1.head()

Simply aggregating the numerical and categorical feature with unique column and renaming column name with new feature names.

 

 

Machine Learning, Pandas, Python

Feature Engineering

  1. Converting the date time feature.
comp['start_date']=pd.to_datetime(comp['start_date'],format='%d/%m/%y',dayfirst=True)
comp['end_date']=pd.to_datetime(comp['end_date'],format='%d/%m/%y',dayfirst=True)
comp['diff_d']=(comp['end_date']-comp['start_date'])/np.timedelta64(1,'D')
comp['diff_m']=(comp['end_date']-comp['start_date'])/np.timedelta64(1,'M')
comp['diff_w']=(comp['end_date']-comp['start_date'])/np.timedelta64(1,'W')
tran['date']=pd.to_datetime(tran['date'],format='%Y-%m-%d')
tran['date_d']=tran['date'].dt.day.astype('category')
tran['date_m']=tran['date'].dt.month.astype('category')
tran['date_w']=tran['date'].dt.week.astype('category')

2. New Feature inclusion from existing feature.

tran['discount_bin']=tran['coupon_discount'].apply(lambda x: 0 if x>=0 else 1)
tran['marked_price']=tran['selling_price']-tran['other_discount']-tran['coupon_discount']
tran['disc_percent']=(tran['marked_price']-tran['selling_price'])/tran['selling_price']
tran['price_per_quan']=tran['marked_price']/tran['quantity']
tran['marked_by_sale']=tran['marked_price']/tran['selling_price']

 

3. Merging the two data frame with aggregation.

tran=tran.merge(tran.groupby(['customer_id','date']).agg({'coupon_id':'count','item_id':'count','disc_percent':sum}).reset_index().rename(columns={'coupon_id':'coupon_aquired','item_id':'item_bought','disc_percent':'tot_disc'}),on=['customer_id','date'],how='left')

 

 

 

 

 

Machine Learning, Pandas, Python

Encoding Categorical Variables

A machine learning model unfortunately cannot deal with categorical variables (except for some models such as LightGBM). Therefore, we have to find a way to encode (represent) these variables as numbers before handing them off to the model. There are two main ways to carry out this process:

  • Label encoding: assign each unique category in a categorical variable with an integer. No new columns are created. An example is shown below

image

  • One-hot encoding: create a new column for each unique category in a categorical variable. Each observation receives a 1 in the column for its corresponding category and a 0 in all other new columns.

image

The problem with label encoding is that it gives the categories an arbitrary ordering. The value assigned to each of the categories is random and does not reflect any inherent aspect of the category. In the example above, programmer recieves a 4 and data scientist a 1, but if we did the same process again, the labels could be reversed or completely different. The actual assignment of the integers is arbitrary. Therefore, when we perform label encoding, the model might use the relative value of the feature (for example programmer = 4 and data scientist = 1) to assign weights which is not what we want. If we only have two unique values for a categorical variable (such as Male/Female), then label encoding is fine, but for more than 2 unique categories, one-hot encoding is the safe option.

There is some debate about the relative merits of these approaches, and some models can deal with label encoded categorical variables with no issues. Here is a good Stack Overflow discussion. I think (and this is just a personal opinion) for categorical variables with many classes, one-hot encoding is the safest approach because it does not impose arbitrary values to categories. The only downside to one-hot encoding is that the number of features (dimensions of the data) can explode with categorical variables with many categories. To deal with this, we can perform one-hot encoding followed by PCA or other dimensionality reduction methods to reduce the number of dimensions (while still trying to preserve information).

Label Encoding and One-Hot Encoding

Let’s implement the policy described above: for any categorical variable (dtype == object) with 2 unique categories, we will use label encoding, and for any categorical variable with more than 2 unique categories, we will use one-hot encoding.

For label encoding, we use the Scikit-Learn LabelEncoder and for one-hot encoding, the pandas get_dummies(df) function.

 

Feature Extraction, Machine Learning, Pandas, Python

Feature Extraction

From the item data, building the new feature extraction by python simple code:

cat_agg=[‘count’,’nunique’]

num_agg=[‘min’,’mean’,’max’,’sum’]

agg_col={

‘device_type’:cat_agg, ‘session_id’:cat_agg, ‘item_id’:cat_agg,’item_price’:num_agg,

‘category_3’:[‘count’,’nunique’,’mean’], ‘product_type’:[‘count’,’nunique’,’mean’]

}

 

for k in view_item.columns:

if k.startswith(‘category_1’) or k.startswith(‘category_2’):

agg_col[k]=[‘sum’,’mean’]

elif k.startswith(‘server’):

agg_col[k]=cat_agg

elif k.startswith(‘cumcount’):

agg_col[k]=num_agg

agg_col

grouping by the item data set:
view_item1=view_item.groupby('user_id').agg(agg_col)

Renaming columns:

view_item1.columns=['J_' + '_'.join(col).strip() for col in view_item1.columns.values]
view_item1.reset_index(inplace=True)
view_item1.head()

 

Machine Learning, Pandas, Python, Statistics

Correlations

Now that we have dealt with the categorical variables and the outliers, let’s continue with the EDA. One way to try and understand the data is by looking for correlations between the features and the target. We can calculate the Pearson correlation coefficient between every variable and the target using the .corr dataframe method.

The correlation coefficient is not the greatest method to represent “relevance” of a feature, but it does give us an idea of possible relationships within the data. Some general interpretations of the absolute value of the correlation coefficentare:

  • .00-.19 “very weak”
  • .20-.39 “weak”
  • .40-.59 “moderate”
  • .60-.79 “strong”
  • .80-1.0 “very strong”
# Find correlations with the target and sort
correlations = app_train.corr()['TARGET'].sort_values()

# Display correlations
print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(15))

Most Positive Correlations:
 OCCUPATION_TYPE_Laborers                             0.043019
FLAG_DOCUMENT_3                                      0.044346
REG_CITY_NOT_LIVE_CITY                               0.044395
FLAG_EMP_PHONE                                       0.045982
NAME_EDUCATION_TYPE_Secondary / secondary special    0.049824
REG_CITY_NOT_WORK_CITY                               0.050994
DAYS_ID_PUBLISH                                      0.051457
CODE_GENDER_M                                        0.054713
DAYS_LAST_PHONE_CHANGE                               0.055218
NAME_INCOME_TYPE_Working                             0.057481
REGION_RATING_CLIENT                                 0.058899
REGION_RATING_CLIENT_W_CITY                          0.060893
DAYS_EMPLOYED                                        0.074958
DAYS_BIRTH                                           0.078239
TARGET                                               1.000000
Name: TARGET, dtype: float64

Most Negative Correlations:
 EXT_SOURCE_3                           -0.178919
EXT_SOURCE_2                           -0.160472
EXT_SOURCE_1                           -0.155317
NAME_EDUCATION_TYPE_Higher education   -0.056593
CODE_GENDER_F                          -0.054704
NAME_INCOME_TYPE_Pensioner             -0.046209
DAYS_EMPLOYED_ANOM                     -0.045987
ORGANIZATION_TYPE_XNA                  -0.045987
FLOORSMAX_AVG                          -0.044003
FLOORSMAX_MEDI                         -0.043768
FLOORSMAX_MODE                         -0.043226
EMERGENCYSTATE_MODE_No                 -0.042201
HOUSETYPE_MODE_block of flats          -0.040594
AMT_GOODS_PRICE                        -0.039645
REGION_POPULATION_RELATIVE             -0.037227
Name: TARGET, dtype: float64
# Find the correlation of the positive days since birth and target
app_train['DAYS_BIRTH'] = abs(app_train['DAYS_BIRTH'])
app_train['DAYS_BIRTH'].corr(app_train['TARGET'])
-0.07823930830982694

As the client gets older, there is a negative linear relationship with the target meaning that as clients get older, they tend to repay their loans on time more often.

Let’s start looking at this variable. First, we can make a histogram of the age. We will put the x axis in years to make the plot a little more understandable.

# Set the style of plots
plt.style.use('fivethirtyeight')

# Plot the distribution of ages in years
plt.hist(app_train['DAYS_BIRTH'] / 365, edgecolor = 'k', bins = 25)
plt.title('Age of Client'); plt.xlabel('Age (years)'); plt.ylabel('Count');


By itself, the distribution of age does not tell us much other than that there are no outliers as all the ages are reasonable. To visualize the effect of the age on the target, we will next make a kernel density estimation plot (KDE) colored by the value of the target. A kernel density estimate plot shows the distribution of a single variable and can be thought of as a smoothed histogram (it is created by computing a kernel, usually a Gaussian, at each data point and then averaging all the individual kernels to develop a single smooth curve). We will use the seaborn kdeplot for this graph.

plt.figure(figsize = (10, 8))

# KDE plot of loans that were repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, 'DAYS_BIRTH'] / 365, label = 'target == 0')

# KDE plot of loans which were not repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, 'DAYS_BIRTH'] / 365, label = 'target == 1')

# Labeling of plot
plt.xlabel('Age (years)'); plt.ylabel('Density'); plt.title('Distribution of Ages');
 
Machine Learning, Pandas, Python

Exploratory Data Analysis

 

Exploratory Data Analysis (EDA) is an open-ended process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data. The goal of EDA is to learn what our data can tell us. It generally starts out with a high level overview, then narrows in to specific areas as we find intriguing areas of the data. The findings may be interesting in their own right, or they can be used to inform our modeling choices, such as by helping us decide which features to use.

Examine the Distribution of the Target Column

The target is what we are asked to predict: either a 0 for the loan was repaid on time, or a 1 indicating the client had payment difficulties. We can first examine the number of loans falling into each category.

app_train['TARGET'].value_counts()
app_train['TARGET'].astype(int).plot.hist();

Examine Missing Values

Next we can look at the number and percentage of missing values in each column.

# Function to calculate missing values by column# Funct 
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

 

# Missing values statistics
missing_values = missing_values_table(app_train)
missing_values.head(20)

 

Column Types

Let’s look at the number of columns of each data type. int64 and float64 are numeric variables (which can be either discrete or continuous). object columns contain strings and are categorical features. .

# Number of each type of column
app_train.dtypes.value_counts()
# Number of unique classes in each object column
app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

 

Machine Learning, Pandas, Python

Encoding Categorical Variables

A machine learning model unfortunately cannot deal with categorical variables (except for some models such as LightGBM). Therefore, we have to find a way to encode (represent) these variables as numbers before handing them off to the model. There are two main ways to carry out this process:

  • Label encoding: assign each unique category in a categorical variable with an integer. No new columns are created. An example is shown below

image

  • One-hot encoding: create a new column for each unique category in a categorical variable. Each observation recieves a 1 in the column for its corresponding category and a 0 in all other new columns.

image

The problem with label encoding is that it gives the categories an arbitrary ordering. The value assigned to each of the categories is random and does not reflect any inherent aspect of the category. In the example above, programmer recieves a 4 and data scientist a 1, but if we did the same process again, the labels could be reversed or completely different. The actual assignment of the integers is arbitrary. Therefore, when we perform label encoding, the model might use the relative value of the feature (for example programmer = 4 and data scientist = 1) to assign weights which is not what we want. If we only have two unique values for a categorical variable (such as Male/Female), then label encoding is fine, but for more than 2 unique categories, one-hot encoding is the safe option.

There is some debate about the relative merits of these approaches, and some models can deal with label encoded categorical variables with no issues. Here is a good Stack Overflow discussion. I think (and this is just a personal opinion) for categorical variables with many classes, one-hot encoding is the safest approach because it does not impose arbitrary values to categories. The only downside to one-hot encoding is that the number of features (dimensions of the data) can explode with categorical variables with many categories. To deal with this, we can perform one-hot encoding followed by PCA or other dimensionality reduction methods to reduce the number of dimensions (while still trying to preserve information).

 

 

 

Pandas, Python, Statistics

Python Outlier – Handson

Outliers

#Boxplot

#Outlier
import seaborn as sns
sns.boxplot(x=df1[‘DAYS_EMPLOYED’])

matplotlib.axes._subplots.AxesSubplot at 0x2e4434b7518>

#ScatterPlot

fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(df1[‘AMT_INCOME_TOTAL’], df1[‘AMT_CREDIT’])
ax.set_xlabel(‘Income Total’)
ax.set_ylabel(‘Credit’)
plt.show()

#ZScore

# Finding Outlier by Stats Zscore

from scipy import stats
import numpy as np
z = np.abs(stats.zscore(numerical_columns))
print(z)

threshold = 3
print(np.where(z > 3))

# First array contains the list of the row and second array contains the respective column numbers
print(z[0][93])

#IQR

#IQR
Q1 = numerical_columns.quantile(0.25)
Q3 = numerical_columns.quantile(0.75)
IQR = Q3 – Q1
print(IQR)