Regression is a method of modelling a target value based on independent predictors. This method is mostly used for forecasting and finding out cause and effect relationship between variables. Regression techniques mostly differ based on the number of independent variables and the type of relationship between the independent and dependent variables. Regression problems usually have one continuous and unbounded dependent variable. The inputs, however, can be continuous, discrete, or even categorical data such as gender, nationality, brand, and so on.
Linear Regression #
Linear regression with multiple variables is also known as “multivariate linear regression”. The multivariable form of the hypothesis function accommodating these multiple features is as follows:
Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:
Gradient Descent for Linear Regression #
Gradient descent is a generic optimization algorithm used in many machine learning algorithms. It iteratively tweaks the parameters of the model in order to minimize the cost function. We do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter , which is called the learning rate. The gradient descent algorithm can be represented as:
To demonstrate the gradient descent algorithm, we initialize the model parameters with 0. The equation becomes . Gradient descent algorithm now tries to update the value of the parameters so that we arrive at the best fit line.
We should adjust our parameter to ensure that the gradient descent algorithm converges in a reasonable time. If is too small, gradient descent can be slow. If is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge. Failure to converge or too much time to obtain the minimum value imply that our step size is wrong.
We can speed up gradient descent by having each of our input values in roughly the same range. This is because will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.
Two techniques to help with this are feature scaling and mean normalization. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:
Where is the average of all the values for feature (i) and is the range of values (max - min), or is the standard deviation. Note that dividing by the range, or dividing by the standard deviation, give different results.
Normal Equation for Linear Regression #
Gradient descent gives one way of minimizing J. In the “Normal Equation” method, we will minimize J by explicitly taking its derivatives with respect to the ’s, and setting them to zero. This allows us to find the optimum theta without iteration. The normal equation formula is given below:
The following is a comparison of gradient descent and the normal equation:
Gradient Descent | Normal Equation |
---|---|
Need to choose alpha | No need to choose alpha |
Needs many iterations | No need to iterate |
, need to calculate inverse of | |
Works well when n is large | Slow if n is very large |
With the normal equation, computing the inversion has complexity . So if we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.
If is noninvertible, the common causes might be having :
- Redundant features, where two features are very closely related (i.e. they are linearly dependent)
- Too many features (e.g. ). In this case, delete some features or use “regularization”.
Solutions to the above problems include deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features.
Evaluating the performance of the Linear Regression Model #
We will be using Root mean squared error(RMSE) and Coefficient of Determination(R² score) to evaluate our model. RMSE is the square root of the average of the sum of the squares of residuals. RMSE is defined by:
score or the coefficient of determination explains how much the total variance of the dependent variable can be reduced by using the least square regression. It can be defined by:
Where is the total sum of errors if we take the mean of the observed values as the predicted value and is the sum of the square of residuals.
References #
[1] Agarwal, A. (2018, November 14). Linear Regression using Python. Medium. https://towardsdatascience.com/linear-regression-using-python-b136c91bf0a2.
[2] Ng, A. (n.d.). Machine Learning. Coursera. https://www.coursera.org/learn/machine-learning.