Journey with Linear Regression

Whenever I am trying my blog , always going back to my initial days with data science. If someone is really new to data science and struggling to learn each and every step of coding. It will be helpful for them to have one handy document which will have some guidance. Yes, this is the only inspiration for me to write. Also I believe in knowledge sharing , then only knowledge can be grown up.

To start with this blog , you can refer some key points about Linear Regression.

In this blog, I am capturing step by step coding in Python.

Importing Packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from statsmodels.stats.outliers_influence import variance_inflation_factor

Data Loading, Processing and Visualization

#loading data

#Get the information about data

#Find out the statstics about data

#Get the sample view from top

#Get the sample view from bottom

# Looking Missing Values

#Data is pretty clean. Now dropping two columns which are going to used in prediction. Date and ID
df = df.drop([‘id’, ‘date’],axis=1)

sns.pairplot(df[[‘sqft_lot’,’sqft_above’,’price’,’sqft_living’,’bedrooms’]], palette=’afmhot’,height=1.6)

#Rescaling the features
#defining a normalisation function
def normalize (x):
return ( (x-np.mean(x))/ (max(x) — min(x)))
# applying normalize ( ) to all columns
df = df.apply(normalize)

Splitting Data , train and test sets


# Putting feature variable to X
X=df[[‘bedrooms’, ‘bathrooms’, ‘sqft_living’, ‘sqft_lot’, ‘floors’,
‘waterfront’, ‘view’, ‘condition’, ‘grade’, ‘sqft_above’,
‘sqft_basement’, ‘yr_built’, ‘yr_renovated’, ‘zipcode’, ‘lat’, ‘long’,
‘sqft_living15’, ‘sqft_lot15’]]
# Putting feature variable to y

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7 ,test_size = 0.3, random_state=100)

Running RFE

# Running RFE with the output number of the variable
lm = LinearRegression()
rfe = RFE(lm, 9) # running RFE
rfe =, y_train)
print(rfe.support_) # Printing the boolean results

col = X_train.columns[rfe.support_]


Building Model

# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[col]


# Adding a constant variable
X_train_rfe = sm.add_constant(X_train_rfe)

lm = sm.OLS(y_train,X_train_rfe).fit() # Running the linear model

#Let’s see the summary of our linear model


vif[‘VIF’] = [variance_inflation_factor(X_train_rfe.values, i) for i in range(X_train_rfe.shape[1])]

vif[“variables”] = X_train_rfe.columns


Dropping the Variable and Updating the Model

# Dropping highly correlated variables and insignificant variables
X_train_new = X_train_rfe.drop(‘sqft_living’, 1)

# Create a second fitted model
lm_2 = sm.OLS(y_train,X_train_new).fit()

#Let’s see the summary of our second linear model

vif[‘VIF’] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif[“variables”] = X_train_new.columns

Prediction using Model 2

# Adding constant variable to test dataframe
X_test_m2 = sm.add_constant(X_test)

# Creating X_test_m6 dataframe by dropping variables from X_test_m6
X_test_m2 = X_test_m2.drop([‘sqft_lot’, ‘floors’,’view’, ‘condition’, ‘yr_renovated’, ‘zipcode’, ‘long’,’sqft_living15',’sqft_lot15',’sqft_living’], axis=1)

# Making predictions
y_pred_m2 = lm_2.predict(X_test_m2)

Model Evaluation

fig, ax = plt.subplots()
ax.scatter(y_test, y_pred_m2)
ax.plot([y.min(), y.max()], [y.min(), y.max()], ‘k — ‘, lw=4)

# Plotting the error terms to understand the distribution.
fig = plt.figure()
fig.suptitle(‘Error Terms’, fontsize=20) # Plot heading
plt.xlabel(‘Residual’, fontsize=18) # X-label
plt.ylabel(‘Index’, fontsize=16)

from sklearn import metrics
print(‘RMSE :’, np.sqrt(metrics.mean_squared_error(y_test, y_pred_m2)))

If you want to see the output of this code, please click on the below link.

Data Analyst , Blogger,

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store