Linear Regression

So far we have done some simple data modelling using linear and quadratic equations. ie in a perfect world, a linear model between one variable \(x\) and another \(y\) can be described as \(y= mx + b\) ie using the equation of a line. Instead of using \(m\) and \(b\) we will use \(\alpha\) to denote the intercept and \(\beta\) to denote the slope. ie \(y = \alpha + \beta x\). Once we solve for \(\alpha\) and \(\beta\) we can estimate what the value of \(y\) will be that corresponds to any given \(x\). All the points will line on this equation. To represent the different data points, we will use the subscript \(i\). Thus our equation becomes:

\[ y_i = \alpha + \beta x_i \]

where \((x_i, y_i)\) represents the \(i^{th}\) point. We can therefore have \((x_1,y_1), x_2,y_2), \cdots, (x_n,y_n)\) to represent \(n\) data points. example:

\(i\) \(x_i\) \(y_i = 2x_i + 1\) \((x_i, y_i)\)
\(1\) \(1.0\) \(3.0\) \((1.0, 3.0)\)
\(2\) \(1.5\) \(4.0\) \((1.5, 4.0)\)
\(3\) \(2.0\) \(5.0\) \((2.0, 5.0)\)
\(4\) \(2.5\) \(6.0\) \((2.5, 6.0)\)
\(5\) \(3.0\) \(7.0\) \((3.0, 7.0)\)

In the case of perfect world, we can use any two points above to find the slope \(\beta\) and the intercept \(\alpha\) and we will obtain the equation \(y = 2x + 1\)

But we are not in a perfect world. In this world, there are errors involved in the measuring/reading/recording of the \(x_i\) and \(y_i\) and many other associated errors. Therefore we end up with points that resemble the true values, but are not themselves true. Because of these errors, not all the points will fit on the line. The equation will now be \(y_i =\alpha + \beta x_i + \epsilon_i\). We therefore need to find a line which would be described as being closer to the true line. If we use any two points, we will have different results. Thus we need to use all the points rather than any two points. The approximate equation can be obtained by forcing the error terms to be zero or as close as possible to zero. ie the closer the \(\epsilon\) is to 0 the better our equation is.

How do we set up the problem? Having the equation \(y_i = \alpha + \beta x_i + \epsilon_i\) we can rewrite the equation as \(\epsilon_i = y_i - (\alpha + \beta x_i)\) . This becomes a root finding problem. ie we find \(\alpha\) and \(\beta\) such that \(e_i = 0\). Previously we noted that a root finding problem can be written as a minimization problem. In this case, we will use squared errors ie

\[ L = \sum_i\epsilon_i^2 = \sum_i (y_i - \hat y_i)^2 \quad\text{where}\quad \hat y_i = \alpha + \beta x_i \]

The objective is to minimize the loss function \(L\). The values of \(\alpha\) and \(\beta\) that gives the minimal loss value denote the best linear model.

Analytically, we could solve this minimization problem using calculus, ie find the derivative, equate to zero and solve for both \(\alpha\) and \(\beta\) simultaneously:

\[ \begin{aligned} \frac{\partial L}{\partial \alpha} : &-2\sum_{i}(y_i - \alpha - \beta x_i) = 0\\ \implies &\sum_iy_i - \beta\sum_ix_i - n\alpha = 0 \implies \alpha = \frac{\sum_iy_i - \beta\sum_ix_i}{n}\\ \therefore\quad &\boxed{\hat \alpha = \bar y - \hat \beta\bar x}\\ \end{aligned} \]

\[ \begin{aligned} \frac{\partial L}{\partial \beta} : &-2\sum_{i}(y_i - \hat y_i)x_i = \sum_{i}\Big[y_i - (\alpha+\beta x_i)\Big]x_i = 0\\ \implies&\sum_{i}\Big[y_i - (\bar y - \beta \bar x + \beta x_i)\Big]x_i = \sum_{i}\Big[(y_i - \bar y) - \beta(x_i - \bar x)\Big]x_i\\ 0&= \sum_{i}(y_i - \bar y)x_i - \beta\sum_i(x_i - \bar x)x_i\\ \implies &\boxed{\hat \beta = \frac{\sum_{i}(y_i - \bar y)x_i}{\sum_i(x_i - \bar x)x_i} = \frac{\sum_i(y_i - \bar y)(x_i - \bar x)}{\sum_i(x_i - \bar x)^2}} \end{aligned} \]

These two gives us the closed form solution for a simple linear model.

Example:

Suppose we have the following data, find the equation of the line that best describes the data.

x y
1 5.1 3.5
2 4.9 3
3 4.7 3.2
4 4.6 3.1
5 5 3.6
6 5.4 3.9
x <- c(5.1, 4.9, 4.7, 4.6, 5, 5.4) 
y <- c(3.5, 3, 3.2, 3.1, 3.6, 3.9)  
y_mean <- mean(y) 
x_mean <- mean(x) 
b <- sum((y - y_mean)*x) / sum((x - x_mean)*x)
a <- y_mean - b*x_mean 
cat('a =', a, '\tb =', b)
a = -1.685944   b = 1.024096
import numpy as np
x = np.array([5.1, 4.9, 4.7, 4.6, 5, 5.4]) 
y = np.array([3.5, 3, 3.2, 3.1, 3.6, 3.9]) 
y_mean = np.mean(y)
x_mean = np.mean(x)
b = np.sum((y - y_mean)*x) / np.sum((x - x_mean)*x) 
a = y_mean - b*x_mean 
print('a = ', a, '\tb = ', b)
a =  -1.6859437751003128    b =  1.0240963855421508

The formulation of \(\alpha\) and \(\beta\) is limited to only one predictor variable \(x\) What if we had many predictors?

Multiple Linear Regression

In the simple linear regression above, we expressed every quantity in a scalar notation. We could rewrite the same using matrix notations:

\[ \begin{aligned} \begin{array}{} y_1 = \alpha + \beta x_1 + \epsilon_1\\ y_2 = \alpha + \beta x_2 + \epsilon_2\\ \vdots\\ y_n = \alpha + \beta x_n + \epsilon_n\\ \end{array}\implies& \begin{pmatrix}y_1\\y_2\\\vdots\\y_n\end{pmatrix} = \alpha\begin{pmatrix}1\\1\\\vdots\\1\end{pmatrix} + \beta\begin{pmatrix}x_1\\x_2\\\vdots\\x_n\end{pmatrix} +\begin{pmatrix}\epsilon_1\\\epsilon_2\\\vdots\\\epsilon_n\end{pmatrix}\\ \\ \\ \implies&\underbrace{\begin{pmatrix}y_1\\y_2\\\vdots\\y_n\end{pmatrix}}_{\mathbf{Y}} = \underbrace{\begin{pmatrix}1&x_1\\1&x_2\\\vdots\\1&x_n\end{pmatrix}}_{\mathbf{X}}\underbrace{\begin{pmatrix}\alpha\\\beta\end{pmatrix}}_{\large\mathbf{\beta}} + \underbrace{\begin{pmatrix}\epsilon_1\\\epsilon_2\\\vdots\\\epsilon_n\end{pmatrix}}_{\large\mathbf{\epsilon}}\\ \implies \boxed{Y = X\beta + \Large\epsilon} \end{aligned} \]

Using this matrix notation, the loss/cost function that we hope to be minimized is expressed as:

\[ \mathcal{L}(\beta;\mathbf{X}, \mathbf{y}) = {\Large \epsilon}^\top{\Large \epsilon} = \left(\mathbf{y} - \mathbf{X}\beta\right)^\top\left(\mathbf{y}-\mathbf{X}\beta\right) = \big|\big|\mathbf{y} - \mathbf{X}\beta\big|\big|^2 \]

we can solve for \(\beta\) which is a vector of the parameters using matrix differentiation of the loss function:

\[ \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\s}{\operatorname{sign}} \begin{aligned} \frac{\partial}{\partial \beta}C(\beta;\mathbf{X,y}) =& -2\mathbf{X}^\top(\mathbf{y-X\beta}) = 0\implies \mathbf{X}^\top\mathbf{y}=\X^\top\X\beta\\ \implies&\boxed{\hat\beta = (\X^\top\X)^{-1}\X^\top\y} \end{aligned} \]

Lets use this formula to solve the previous problem:

x <- c(5.1, 4.9, 4.7, 4.6, 5, 5.4) 
y <- c(3.5, 3, 3.2, 3.1, 3.6, 3.9) 
X <- cbind(a = 1, b = x)
solve(t(X)%*%X, t(X)%*%y)
       [,1]
a -1.685944
b  1.024096
solve(crossprod(X), crossprod(X, y))
       [,1]
a -1.685944
b  1.024096
import numpy as np  
x = np.array([5.1, 4.9, 4.7, 4.6, 5, 5.4]) 
y = np.array([3.5, 3, 3.2, 3.1, 3.6, 3.9])  
X = np.c_[np.ones(x.size), x] 
np.linalg.solve(X.T @ X, X.T @ y)
array([-1.68594378,  1.02409639])

Note that we still get the same results as before.

The notion of writing the problem in matrix notation simplifies the problem of multiple variables. The formulation and computation remains as shown above:

\[ \begin{aligned} \mathbf{X} =& \begin{array}{c|c|c|c}\operatorname{intercept}&\operatorname{variable}_1&\dots& \operatorname{variable}_p\\\hline1&x_{11}&\dots&x_{1n}\\ 1&x_{21}&\dots&x_{2n}\\ \vdots&\vdots&\ddots&\vdots\\ 1&x_{n1}&\dots&x_{nn}\\ \end{array}, \quad \mathbf{y} = \begin{bmatrix}y_1\\y_2\\\vdots\\y_n\end{bmatrix}\\ \text{ The }i^{th}&\text{ observation } \mathbf{x}_i^T = \begin{bmatrix} 1&x_{i1}&\dots&x_{ip}\end{bmatrix}\\ \text{The cost function that} &\text{ we hope to minimize: }\\ \mathcal{L}(\beta;\mathbf{X},\mathbf{y}) = &\sum_{i=1}^n (y_i - \mathbf{x_i}^T\beta)^2 = (\mathbf{y-X\beta})^\top(\mathbf{y-X\beta}) = ||\mathbf{y-X\beta}||^2\\ \text{And the minimization}& \text{ occurs at}:\\ \hat\beta = & (\X^\top\X)^{-1}\X^\top\y\quad\text{as previously shown}. \end{aligned} \]

Example 2:

Fit a linear model to the iris data set taking Sepal.Length to be the response variable while Sepal.Width , Petal.Length and Petal.Width to be the predictor variables. Include the intercept term.

y <- iris$Sepal.Length # y <- iris[,1]
x <- iris[c("Sepal.Width", "Petal.Length", "Petal.Width")] # x <- iris[2:4]  
X <- cbind(intercept = 1, as.matrix(x)) # column bind 1 as intercept
solve(t(X)%*%X, t(X)%*%y) # solve(crossprod(X), crossprod(X,y))
                   [,1]
intercept     1.8559975
Sepal.Width   0.6508372
Petal.Length  0.7091320
Petal.Width  -0.5564827
import numpy as np 
from sklearn.datasets import load_iris 
x, Species = load_iris(return_X_y = True)  
y = x[:,0] #the first column as y 
X = np.c_[np.ones(y.size), x[:,1:]] # column bind 1 as intercept
np.linalg.solve(X.T @ X, X.T @ y) # @ is matrix multiplication
array([ 1.85599749,  0.65083716,  0.70913196, -0.55648266])

From the above results, we could write our equation as:

\[ \begin{array}{}\hat y_i = 1.856 + 0.651X_{1i} + 0.709X_{2i} -0.556X_{3i}\end{array}\quad \text{where } \quad\begin{array}{ccc}\hat y =\text{Predicted Sepal Length}\\X_1 = \text{Sepal Width}\\ X_2 = \text{Petal Length}\\X_{3} =\text{Petal Width} \end{array} \]

Dealing with categorical/factors.

For factors, we would have to carry out a one hot encoding on each factor variable. Depending on the contrast chosen (mostly treatment contrast), we convert each level to a variable. In the case that an intercept is included in the model, we will have to drop one level ie the base level. This is because maintaning all the levels and the intercept will lead to linearly dependent variables thus the covariance matrix will be rank deficient.

Example: Include Species in the regression model of iris dataset.

y <- iris[,1]# First column iris

# convert column species to one hot encoding, exclude 1st column species
species <- with(iris, diag(nlevels(Species))[Species, -1])
colnames(species) <- levels(iris$Species)[-1]

#make X include intercept
X <- cbind(Intercept = 1, as.matrix(iris[,2:4]), species)

solve(t(X) %*% X, t(X) %*% y)
                   [,1]
Intercept     2.1712663
Sepal.Width   0.4958889
Petal.Length  0.8292439
Petal.Width  -0.3151552
versicolor   -0.7235620
virginica    -1.0234978

Note we do not need to load the data/packages as they had previously been loaded

import numpy as np
from sklearn.datasets import load_iris
iris, Species = load_iris(return_X_y = True) 

y = iris[:,0]# First column iris

# convert column species to one hot encoding, exclude 1st column species
species = np.identity(Species.max() + 1)[Species][:,1:]

#make X include intercept, 
X = np.c_[np.ones(y.size), iris[:,1:], species]
np.linalg.solve(X.T @ X, X.T @ y)
array([ 2.17126629,  0.49588894,  0.82924391, -0.31515517, -0.72356196,
       -1.02349781])

Fitting linear model using packages

There are three main functions to use lm , lm.fit and lsfit.

lsfit and lm.fit : Since our X matrix above includes the intercept column, when using these functions, you could decide to drop that particular column or simply use intercept = FALSE as shown below. Notice that X must be a matrix. and y can be a vector, or matrix if there are many dependent variables.

# we have already included an intercept in our ma
model = lsfit(X,y, intercept = FALSE) 
model$coefficients
   Intercept  Sepal.Width Petal.Length  Petal.Width   versicolor    virginica 
   2.1712663    0.4958889    0.8292439   -0.3151552   -0.7235620   -1.0234978 

There are other information computed that we have not covered here. You can learn on how each component is computed.

lm.fit(X,y)$coefficients
   Intercept  Sepal.Width Petal.Length  Petal.Width   versicolor    virginica 
   2.1712663    0.4958889    0.8292439   -0.3151552   -0.7235620   -1.0234978 

Lastly the commonly used function is the lm function. Notice that the two given functions above only take numeric values. You will have to do one hot encoding to all the factors/characters before you could run your model. We can make R to automatically do this for us:

formula <- Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
model <- lm(formula, iris)
summary(model)

Call:
lm(formula = formula, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.79424 -0.21874  0.00899  0.20255  0.73103 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        2.17127    0.27979   7.760 1.43e-12 ***
Sepal.Width        0.49589    0.08607   5.761 4.87e-08 ***
Petal.Length       0.82924    0.06853  12.101  < 2e-16 ***
Petal.Width       -0.31516    0.15120  -2.084  0.03889 *  
Speciesversicolor -0.72356    0.24017  -3.013  0.00306 ** 
Speciesvirginica  -1.02350    0.33373  -3.067  0.00258 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3068 on 144 degrees of freedom
Multiple R-squared:  0.8673,    Adjusted R-squared:  0.8627 
F-statistic: 188.3 on 5 and 144 DF,  p-value: < 2.2e-16

Take note on how we run the model. Note that we were able to not only use the dataframe directly but specify a formula that captured the essence of the model being fitted. What if we had 100 variables? What if we had to fit interaction terms, quadratic terms change contrast etc? This are things to be learned. For example, There is a shortcut to run a linear model on all the variables at once. ie lm(y ~ ., data) eg lm(Sepal.Length~., iris)

In python, we can use 3 ways to solve the same problem. ie sklearn.linear_model.LinearRegression or use the sm.ols

from sklearn.linear_model import LinearRegression as lm
reg = lm().fit(X[:,1:], y)
reg.intercept_
2.1712662921550714
reg.coef_
array([ 0.49588894,  0.82924391, -0.31515517, -0.72356196, -1.02349781])
import statsmodels.api as sm
from statsmodels.formula.api import ols as lm

iris = sm.datasets.get_rdataset('iris').data
iris.columns = iris.columns.str.replace('.', '_', regex = False)
formula = "Sepal_Length ~ Sepal_Width + Petal_Length + Petal_Width + Species"
model = lm(formula, iris).fit()
model.summary()
OLS Regression Results
Dep. Variable: Sepal_Length R-squared: 0.867
Model: OLS Adj. R-squared: 0.863
Method: Least Squares F-statistic: 188.3
Date: Wed, 06 Sep 2023 Prob (F-statistic): 2.67e-61
Time: 13:27:05 Log-Likelihood: -32.558
No. Observations: 150 AIC: 77.12
Df Residuals: 144 BIC: 95.18
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 2.1713 0.280 7.760 0.000 1.618 2.724
Species[T.versicolor] -0.7236 0.240 -3.013 0.003 -1.198 -0.249
Species[T.virginica] -1.0235 0.334 -3.067 0.003 -1.683 -0.364
Sepal_Width 0.4959 0.086 5.761 0.000 0.326 0.666
Petal_Length 0.8292 0.069 12.101 0.000 0.694 0.965
Petal_Width -0.3152 0.151 -2.084 0.039 -0.614 -0.016
Omnibus: 0.418 Durbin-Watson: 1.966
Prob(Omnibus): 0.811 Jarque-Bera (JB): 0.572
Skew: -0.060 Prob(JB): 0.751
Kurtosis: 2.722 Cond. No. 94.0


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

One could also use sm.formula.ols(formula, data).

Another important notion is to note that there are various values associated with the linear regression output. All these can be calculated directly using basic formulas. These other results are mostly used for model evaluation and test of hypothesis.

Gradient Descent for Linear Regression:

The computation of the covariance matrix is very expensive especially when the data is very huge. Of course there are methods that have been developed to solve this issue, eg SVD which can be used to obtain the solutions to the parameters.

Sometimes we have very large data that cannot fit into memory. In that case we cannot solve the linear equations to obtain the parameters \(\beta\). A numerical way ie the gradient descent is therefore used to obtain the solution. Note that with this method, for a large dataset, you could read in chunks of data into the memory, update your results and delete the data from memory before reading in another chunk.

A quick gradient descent for the iris dataset:

x <- cbind(intercept = 1, as.matrix(iris[,2:4]))
y <- iris[,1]
b <- rep(1, 4)
n <- nrow(x)
for (i in 1:100000){
  b  <- b - 0.02* t(x) %*% -(y - x %*% b)/n
}
b
                   [,1]
intercept     1.8559975
Sepal.Width   0.6508372
Petal.Length  0.7091320
Petal.Width  -0.5564827
n = iris.shape[0]
x = np.c_[np.ones(n), iris.iloc[:,1:-1].values]
y =  iris.iloc[:,0].values
b = np.ones(x.shape[1])
for i in range(100000):
  b -= -0.02 * x.T @ (y - x @ b)/n
b
array([ 1.85599749,  0.65083716,  0.70913196, -0.55648266])

Weighted Linear Regression

Regularized Regression

Back to top