Feature Selection – HuntDataScience

RFE : Recursive Feature Elimination

Model building before feature Elimination:

# Building Logistic regression with feature elimination
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression().fit(X_train, y_train)

Score:

Training set accuracy: 0.790 Test set accuracy: 0.745

Feature columns:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome

#feature selection by RFE approach
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
from sklearn.feature_selection import RFE
rfe = RFE(logreg, 4) # running RFE with 13 variables as output
rfe = rfe.fit(X,y)
print(rfe.support_) # Printing the boolean results
print(rfe.ranking_) # Printing the ranking

[ True True False False False True True False] [1 1 2 4 5 1 1 3]

# Variables selected by RFE
col = [‘Pregnancies’, ‘Glucose’, ‘BMI’, ‘DiabetesPedigreeFunction’]

Building the Model after feature elimination:

logreg.fit(X_train[col], y_train)
logreg = LogisticRegression().fit(X_train[col], y_train)
print(“Training set accuracy: {:.3f}”.format(logreg.score(X_train[col], y_train)))
print(“Test set accuracy: {:.3f}”.format(logreg.score(X_test[col], y_test)))

Result:

Training set accuracy: 0.780
Test set accuracy: 0.758

Dep. Variable:	Outcome	No. Observations:	537
Model:	GLM	Df Residuals:	528
Model Family:	Binomial	Df Model:	8
Link Function:	logit	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-245.19
Date:	Thu, 08 Aug 2019	Deviance:	490.37
Time:	16:19:56	Pearson chi2:	667.
No. Iterations:	5	Covariance Type:	nonrobust

Dep. Variable:

Outcome

No. Observations:

537

Model:

GLM

Df Residuals:

528

Model Family:

Binomial

Df Model:

Link Function:

logit

Scale:

1.0000

Method:

IRLS

Log-Likelihood:

-245.19

Date:

Thu, 08 Aug 2019

Deviance:

490.37

Time:

16:19:56

Pearson chi2:

667.

No. Iterations:

Covariance Type:

nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-9.3762	0.908	-10.328	0.000	-11.155	-7.597
Pregnancies	0.1084	0.039	2.803	0.005	0.033	0.184
Glucose	0.0373	0.005	7.973	0.000	0.028	0.046
BloodPressure	-0.0096	0.006	-1.566	0.117	-0.022	0.002
SkinThickness	-0.0004	0.008	-0.048	0.962	-0.017	0.016
Insulin	-0.0012	0.001	-1.103	0.270	-0.003	0.001
BMI	0.0952	0.018	5.197	0.000	0.059	0.131
DiabetesPedigreeFunction	1.3783	0.367	3.758	0.000	0.659	2.097
Age	0.0202	0.011	1.809	0.070	-0.002	0.042

coef

std err

P>|z|

[0.025

0.975]

const

-9.3762

0.908

-10.328

0.000

-11.155

-7.597

Pregnancies

0.1084

0.039

2.803

0.005

0.033

0.184

Glucose

0.0373

0.005

7.973

0.000

0.028

0.046

BloodPressure

-0.0096

0.006

-1.566

0.117

-0.022

0.002

SkinThickness

-0.0004

0.008

-0.048

0.962

-0.017

0.016

Insulin

-0.0012

0.001

-1.103

0.270

-0.003

0.001

BMI

0.0952

0.018

5.197

0.000

0.059

0.131

DiabetesPedigreeFunction

1.3783

0.367

3.758

0.000

0.659

2.097

Age

0.0202

0.011

1.809

0.070

-0.002

0.042

Category: Feature Selection

Feature Selection Technique 2

Feature Selection Technique-1