It’s my childhood habit to write some short notes about the topic whatever I used to study. Later on, I used to follow the same path in office. It helps me always to refer in future.
Whenever I am preparing for interview , I checking my notes for reference.
In this blog I am writing important reference points about Linear Regression machine learning algorithm. At any point of time , it can be referred to refresh the understanding about this algorithm.
In general , there are two types of linear regression , simple and multiple. Let’s find out the key points for each one of these.
SIMPLE LINEAR REGRESSION
1.Relationship between a dependent variable and one independent variable using a straight line.
2.The standard equation of the regression line is given by the following expression:
Y = β₀ + β₁.X , β₀= Intercept , β₁= Slope
3.The best-fit line is found by minimizing the expression of RSS (Residual Sum of Squares).
4. RSS is equal to the sum of squares of the residual for each data point in the plot.
5. Residual for any data point is found by subtracting predicted value of dependent variable from actual value of dependent variable.
6. The strength of the linear regression model can be assessed by R 2 or Coefficient of Determination
7. R 2 or Coefficient of Determination:
o R2 is a number which explains what portion of the given data variation is explained by the developed model.
o It always takes a value between 0 & 1.
o In general term, it provides a measure of how well actual outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model, i.e. expected outcomes.
o Overall, the higher the R-squared, the better the model fits your data.
o Mathematically, it is represented as: R2 = 1 – (RSS / TSS)
8. RSS (Residual Sum of Squares):
It is defined as the total sum of error across the whole sample.
It is the measure of the difference between the expected and the actual output.
A small RSS indicates a tight fit of the model to the data.
9. TSS (Total sum of squares):
It is the sum of errors of the data points from mean of response variable.
TSS gives us the deviation of all the points from the mean line.
Multiple Linear Regression
- Relationship between one dependent variable and several independent variables (explanatory variables).
- Steps to follow
o Prepare data for analysis and build a model containing all variables
o Check VIF and summary. Remove variables with high VIF (>2 generally) and which are insignificant (p>0.05), one by one.
o If the model has variables which have a high VIF and are significant, check and remove other insignificant variables
o Now, variables must be significant.
o If the number of variables is still high, remove them in order of insignificance until you arrive at a limited number of variables that explain the model well.
- p-value indicates the probability that the null hypothesis is true
- Independent variables that have a low p-value are likely to be a meaningful addition to model.
- Dummy Variables: The categorical variables need to be converted to numeric form to be used in regression modeling.
- R-squared: Always increases when number of variables increase
- Adjusted R-squared: Imposes penalty if the increase in R-squared is small• Adjusted R-squared is a better metric than R-squared to assess how good the model fits the data.
- Multicollinearity is a phenomenon where two or more independent variables in a multiple regression model are correlated to each other. It makes difficult to assess the effect of individual predictors. A better way to assess multicollinearity is to compute the variance inflation factor (VIF).
- To summarize: the higher the VIF, the higher the multicollinearity.
- Model Validation: It is desired that the R-squared between the predicted value and the actual value in the test set should be high.
- Variable Selection Methods:
o Backward selection
o Forward selection
o Stepwise selection