Journey with Linear Regression

Whenever I am trying my blog , always going back to my initial days with data science. If someone is really new to data science and struggling to learn each and every step of coding. It will be helpful for them to have one handy document which will have some guidance. Yes, this is the only inspiration for me to write. Also I believe in knowledge sharing , then only knowledge can be grown up.

To start with this blog , you can refer some key points about Linear Regression.

https://medium.com/@arpita.mukh1/quick-reference-points-about-linear-regression-5b6659e1bce3

In this blog, I am capturing step by step coding in Python.

Importing Packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from statsmodels.stats.outliers_influence import variance_inflation_factor

Data Loading, Processing and Visualization

#loading data
df=pd.read_csv(‘kc_house_data.csv’)

#Get the information about data
df.info()

#Find out the statstics about data
df.describe()

#Get the sample view from top
df.head()

#Get the sample view from bottom
df.tail()

# Looking Missing Values
print(df.isnull().any())

#Data is pretty clean. Now dropping two columns which are going to used in prediction. Date and ID
df = df.drop([‘id’, ‘date’],axis=1)

sns.pairplot(df[[‘sqft_lot’,’sqft_above’,’price’,’sqft_living’,’bedrooms’]], palette=’afmhot’,height=1.6)

#Rescaling the features
#defining a normalisation function
def normalize (x):
return ( (x-np.mean(x))/ (max(x) — min(x)))
# applying normalize ( ) to all columns
df = df.apply(normalize)

Splitting Data , train and test sets

df.columns

# Putting feature variable to X
X=df[[‘bedrooms’, ‘bathrooms’, ‘sqft_living’, ‘sqft_lot’, ‘floors’,
‘waterfront’, ‘view’, ‘condition’, ‘grade’, ‘sqft_above’,
‘sqft_basement’, ‘yr_built’, ‘yr_renovated’, ‘zipcode’, ‘lat’, ‘long’,
‘sqft_living15’, ‘sqft_lot15’]]
# Putting feature variable to y
y=df[‘price’]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7 ,test_size = 0.3, random_state=100)

Running RFE

# Running RFE with the output number of the variable
lm = LinearRegression()
rfe = RFE(lm, 9) # running RFE
rfe = rfe.fit(X_train, y_train)
print(rfe.support_) # Printing the boolean results
print(rfe.ranking_)

col = X_train.columns[rfe.support_]

col

Building Model

# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[col]

X_train_rfe.columns

# Adding a constant variable
X_train_rfe = sm.add_constant(X_train_rfe)

lm = sm.OLS(y_train,X_train_rfe).fit() # Running the linear model

#Let’s see the summary of our linear model
print(lm.summary())

vif=pd.DataFrame()

vif[‘VIF’] = [variance_inflation_factor(X_train_rfe.values, i) for i in range(X_train_rfe.shape[1])]

vif[“variables”] = X_train_rfe.columns

vif

Dropping the Variable and Updating the Model

# Dropping highly correlated variables and insignificant variables
X_train_new = X_train_rfe.drop(‘sqft_living’, 1)

# Create a second fitted model
lm_2 = sm.OLS(y_train,X_train_new).fit()

#Let’s see the summary of our second linear model
print(lm_2.summary())

vif=pd.DataFrame()
vif[‘VIF’] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif[“variables”] = X_train_new.columns
vif

Prediction using Model 2

# Adding constant variable to test dataframe
X_test_m2 = sm.add_constant(X_test)

# Creating X_test_m6 dataframe by dropping variables from X_test_m6
X_test_m2 = X_test_m2.drop([‘sqft_lot’, ‘floors’,’view’, ‘condition’, ‘yr_renovated’, ‘zipcode’, ‘long’,’sqft_living15',’sqft_lot15',’sqft_living’], axis=1)

# Making predictions
y_pred_m2 = lm_2.predict(X_test_m2)

Model Evaluation

fig, ax = plt.subplots()
ax.scatter(y_test, y_pred_m2)
ax.plot([y.min(), y.max()], [y.min(), y.max()], ‘k — ‘, lw=4)
ax.set_xlabel(‘Measured’)
ax.set_ylabel(‘Predicted’)
plt.show()

# Plotting the error terms to understand the distribution.
fig = plt.figure()
sns.distplot((y_test-y_pred_m2),bins=50)
fig.suptitle(‘Error Terms’, fontsize=20) # Plot heading
plt.xlabel(‘Residual’, fontsize=18) # X-label
plt.ylabel(‘Index’, fontsize=16)

from sklearn import metrics
print(‘RMSE :’, np.sqrt(metrics.mean_squared_error(y_test, y_pred_m2)))

If you want to see the output of this code, please click on the below link.

Data Analyst , Blogger, https://arpitatechcorner.com/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store