Table of Content

Block

Remarks

I’ve recently sign up for machine learning this semester, I don’t really know what to expect but I’ll try my best anyway. While I had initial reservations about diving into this course so early, I’ve decided to embrace the challenge, and there were really no other choice honestly. I believe that true mastery in machine learning comes from a deep understanding of the underlying concepts, not just from using pre-built libraries.

In this post, I’ll try my best to provide a comprehensive overview of Linear Regression, covering its mathematical foundations, common solution approaches, and real-world applications. I’ll delve into the theory and concepts, while practical examples of Linear Regression will be covered in another post which will also be my Linear Regression Assignment for the course.

I, What is Regression?

Regression is a statistical technique used for modeling and analyzing relationships between a dependent variable (target) and one or more independent variables (predictors or features). In essence, regression is used to predict the value of the dependent variable based on the values of the independent variables. It plays a key role in both prediction and forecasting, as well as understanding the relationships within data. In its simplest form and for the context of machine learning, if you remember one of those problems back in highschool math where you have to find an equation to fit certain points in a graph, then thats a start! Some more specific examples involve using tax rate, inflation, etc… to predict house prices, or market trend. The general participants in a regression problem is as follow:

y: The target variable - which in this case would always be continuousx=(x1,x2,,xd)TRd: The predictive variables - could be continuous or discrete\begin{gather*} y : \text{ The target variable - which in this case would always be continuous} \quad \\ \mathbf{x} = (x_1, x_2, \ldots, x_d)^T \in \mathbb{R}^d : \text{ The predictive variables - could be continuous or discrete} \quad \end{gather*}
Block

Example: Evaluation of Predicted vs. Actual House Prices

Example: Evaluation of Predicted vs. Actual House Prices

This is a fundamental technique in statistical modeling, to understand modern and machine learning approaches that leveraged on this concept, we must first grasp Classic Regression. Which often refers to a few main types of regression:

  • Linear Regression: A regression technique that models the relationship between independent variables and a dependent variable as a linear function.
    • Assumes linearity in data, meaning the relationship between predictors and the target variable is modeled as a straight line.
    • Uses Ordinary Least Squares (OLS) to estimate the coefficients by minimizing the sum of squared residuals.
    • General form:
y=β0+β1x1+β2x2++βdxd+ϵ\begin{equation} y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_d x_d + \epsilon \quad \end{equation}
Block
  • Non-linear Regression: A regression technique used to model relationships between variables that do not follow a straight line, but rather involve more complex relationships.
    • The model can involve exponential, logarithmic, polynomial, or other forms of non-linear relationships.
    • It uses iterative algorithms to fit the parameters since the relationship is not a simple linear function (eg. Newton Raphson).
    • General form:
y=β0+β1x1+β2x22++βdxdd+ϵ\begin{equation} y = \beta_0 + \beta_1 x_1 + \beta_2 x^2_2 + \ldots + \beta_d x^d_d + \epsilon \quad \end{equation}
Block
  • Single Regression: Regression with a single dependent variable and one independent variable.
xR and yRx \in \mathbb{R} \text{ and } y \in \mathbb{R} \quad
Block
  • Multiple Regression: Regression analysis where there is one dependent variable and multiple independent variables.
xRd and yRx \in \mathbb{R}^d \text{ and } y \in \mathbb{R} \quad
Block
  • Multivariate Regression: Regression analysis with multiple dependent variables and multiple independent variables.
xRd and yRcc>1x \in \mathbb{R}^d \text{ and } y \in \mathbb{R}^c \quad | \quad c > 1 \quad
Block

II, Linear Regression - General Concept

Prediction Function

In linear regression, the prediction function is typically written as:

hθ(x)=θ0+θ1x1++θdxd\begin{equation} h_\theta(x) = \theta_0 + \theta_1 x_1 + \ldots + \theta_d x_d \quad \end{equation}
Block

Here, θ0\theta_0 represents the intercept term.

Intercept Term Inclusion

In practice, it’s common to “add a column of ones to the data”. This means we set x01x_0 \equiv 1 for all data points, allowing us to rewrite the prediction function hθ(x)h_\theta(x) as:

hθ(x)=i=0dθixi=θTx\begin{equation} h_\theta(x) = \sum_{i=0}^{d} \theta_i x_i = \boldsymbol{\theta}^T \mathbf{x} \quad \end{equation}
Block
Where θ and x are vectors: θ=[θ0,θ1,...,θd]T and x=[1,x1,...,xd]T\begin{gather*} \text{Where } \boldsymbol{\theta} \text{ and } \mathbf{x} \text{ are vectors: } \quad \\ \boldsymbol{\theta} = [\theta_0, \theta_1, ..., \theta_d]^T \text{ and } \mathbf{x} = [1, x_1, ..., x_d]^T \quad \end{gather*}
Block

This practice means a few things for us:

  • Mathematical convenience: By adding a column of 1s, we can represent the intercept term as a coefficient in the linear regression equation. This simplifies the calculations and makes the model more consistent.
  • Flexibility: Including an intercept term allows the model to account for a baseline level of the dependent variable, even when the independent variables are zero. This can be useful in many real-world scenarios.

Ordinary Least Squares (OLS)

The Ordinary Least Squares (OLS) method is used to determine the best-fitting line by minimizing the difference between the predicted values and the actual values in the training data. The training data with NN data points is given in the format below:

(x1,y1),(x2,y2),,(xN,yN)\begin{equation} (x_1,y_1),(x_2,y_2), \ldots ,(x_N,y_N) \quad \end{equation}
Block

To estimate the coefficients θRd+1\boldsymbol{\theta} \in \mathbb{R}^{d+1}, we need to minimize the error of the model given the training data. The error is captured using a cost function J(θ)J(\theta).

J(θ)=12Nn=1N(hθ(xn)yn)2\begin{equation} J(\theta) = \frac{1}{2N} \sum_{n=1}^{N} (h_\theta(x_n) - y_n)^2 \quad \end{equation}
Block

Note that we use the constant term 12N\frac{1}{2N} here to simplifies the calculation which will involve derivatives. In essence, you can use 12,1N and 12N\frac{1}{2}, \frac{1}{N} \text{ and } \frac{1}{2N}, which should all be valid in context of MSE.

The objective of this practice is to minimize the cost function (aka Mean Squared Error - MSE) This means finding the parameter vector θ\boldsymbol{\theta} that results in the smallest possible value of J(θ)J(θ). In other words, we are trying to find the best coefficients that lead to the least amount of error between the predicted and actual values.:

J(θ)minθ\begin{equation} J(\theta) \rightarrow \min_{\theta} \quad \end{equation}
Block

Extra: Approach with Maximum Likelihood Estimation (MLE)

To have a clear path towards finding the best performing coefficients, we can use the Maximum Likelihood Estimation (MLE) approach. In practice, to predict the target yy for each data point xx, assume we have prediction function hθ(x)h_\theta(x) with error term of e\mathbf{e} (Meaning that we assume this prediction function gives output that is e\mathbf{e} off from the actual target yy):

y=hθ(x)+e\begin{equation} y = h_\theta(x) + \mathbf{e} \quad \end{equation}
Block

So assume e\mathbf{e} follows a one-dimensional Normal Distribution:

eN(0,σ2)\begin{equation} \mathbf{e} \sim \mathcal{N}(0, \sigma^2) \quad \end{equation}
Block

To find the MLE, we define the Log-Likelihood function given the data as:

(θ)=n=1Nlog[12πσ2exp((ynhθ(xn))22σ2)]=Nlog(12πσ2)12σ2n=1N(ynhθ(xn))2=:T1T2\begin{aligned} \ell(\theta) & =\sum_{n=1}^N \log \left[\frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left(-\frac{\left(y_n-h_\theta\left(x_n\right)\right)^2}{2 \sigma^2}\right)\right] \\ & =N \log \left(\frac{1}{\sqrt{2 \pi \sigma^2}}\right)-\frac{1}{2 \sigma^2} \sum_{n=1}^N\left(y_n-h_\theta\left(x_n\right)\right)^2=: T_1-T_2 \quad \end{aligned}
Block

Given that T1T_1 is a constant term, maximizing (θ)\ell(\theta) equates to minimizing:

T2=12σ2n=1N(ynhθ(xn))2=J(θ)T_2=\frac{1}{2 \sigma^2} \sum_{n=1}^N\left(y_n-h_\theta\left(x_n\right)\right)^2=J(\theta) \quad
Block

Given that J(θ)J(\theta) is the cost function in OLS:

  • Assuming our error term e\mathbf{e} actually follows a Normal Distribution then the MLE approach will help us to achieve OLS.
  • We see that in no way is θ\theta is dependant on the variance σ2\sigma^2. Therefore even in cases of ambiguous variance, we can still reasonably estimate θ\theta.

III, Linear Regression - Common Approaches

Single Regression - Sum of Squares (Normal Equation)

Given dataset with NN data points, with signature xn,ynRx_n,y_n \in \mathbb{R}. This is a classic example of Single Regression, a single independent variable xx and a single dependent variable yy. With that knowledge, we calculate the Expected Value and variance as:

E(yX=x)=hθ(x)=θ0+θ1x;var(yX=x)=σ2\mathbb{E}(y \mid X=x)=h_\theta(x)=\theta_0+\theta_1 x ; \quad \operatorname{var}(y \mid X=x)=\sigma^2 \quad
Block

In which, θ0\theta_0 is called the Intercept Term, θ1\theta_1 is called the Slope. We continue by calculating the Sample Variance and Sample Covariance:

xˉ=1Nn=1Nxnyˉ=1Nn=1NynSXX=n=1N(xnxˉ)2 - Sample Variance SXY=n=1N(xnxˉ)(ynyˉ) - Sample Covariance \begin{aligned} \bar{x} & =\frac{1}{N} \sum_{n=1}^N x_n \quad \bar{y}=\frac{1}{N} \sum_{n=1}^N y_n \\ S X X & =\sum_{n=1}^N\left(x_n-\bar{x}\right)^2 \quad \text { - Sample Variance } \\ S X Y & =\sum_{n=1}^N\left(x_n-\bar{x}\right)\left(y_n-\bar{y}\right) \quad \text { - Sample Covariance }\quad \end{aligned}
Block

So, when:

X=(1x11x21xN)y=(y1y2yN)\mathbf{X}=\left(\begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_N \end{array}\right) \quad \mathbf{y}=\left(\begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array}\right) \quad
Block

Then:

(XTX)=(Ni=1Nxin=1Nxnn=1N(xn)2)XTy=(i=nNyni=nNyn2)\left(\mathbf{X}^T \mathbf{X}\right)=\left(\begin{array}{cc} N & \sum_{i=1}^N x_i \\ \sum_{n=1}^N x_n & \sum_{n=1}^N\left(x_n\right)^2 \end{array}\right) \quad \mathbf{X}^T \mathbf{y}=\binom{\sum_{i=n}^N y_n}{\sum_{i=n}^N y_n^2}\quad
Block

By direct computation:

(XTX)1=1SXX(1Nn=1N(xn)2xˉXˉ1)\left(\mathbf{X}^T \mathbf{X}\right)^{-1}=\frac{1}{S X X}\left(\begin{array}{cc} \frac{1}{N} \sum_{n=1}^N\left(x_n\right)^2 & -\bar{x} \\ -\bar{X} & 1 \end{array}\right) \quad
Block

Therefore:

θ^=(θ^0θ^1)=(XTX)1XTy=(yˉxˉθ^1SXY/SXX)\hat{\theta}=\binom{\hat{\theta}_0}{\hat{\theta}_1}=\left(\mathbf{X}^T \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{y}=\binom{\bar{y}-\bar{x} \hat{\theta}_1}{S X Y / S X X} \quad
Block

Example: Single Regression

Example: Single Regression - Predicting Pressure (y) from Temperature (x)

Note: Sum of Squares is the simplest case of Normal Equation.

Multiple Regression

When applied to Single Regression, OLS usually solves the problem using direct calculation with the Sum of Squares as demonstrated above. However when moving to Multiple Regression, we must be able to generalize the concept of OLS to effectively solve such problems, below I will go in depth to some of the most common methods to achieve OLS in context of Multiple Regression:

Normal Equation

The Normal Equation is a powerful method for finding the optimal parameters (weights) in a linear regression model. It provides a systematic way to minimize the difference between the predicted values and the actual observed values in the dataset.

Say we are given dataset with dimensionality of X×yRd×RnX \times y \subset \mathbb{R}^d \times \mathbb{R}^n where:

X=((x1)T(x2)T(xN)T)=(1x11x21xd21x12x22xd21x1Nx2NxdN) and y=(y1y2yN)\mathbf{X}=\left(\begin{array}{c}\left(x_1\right)^T \\ \left(x_2\right)^T \\ \vdots \\ \left(x_N\right)^T\end{array}\right)=\left(\begin{array}{ccccc}1 & x_{11} & x_{21} & \cdots & x_{d2} \\ 1 & x_{12} & x_{22} & \cdots & x_{d2} \\ \vdots & \vdots & \ddots & \vdots & \\ 1 & x_{1N} & x_{2N} & \cdots & x_{dN}\end{array}\right) \\ \text{ and } \mathbf{y}=\left(\begin{array}{c}y_1 \\ y_2 \\ \vdots \\ y_N\end{array}\right) \quad
Block

Note that we have already implemented the inclusion of the Intercept Term, hence the column of ones. Therefore Matrix XX here can be considered an Design Matrix XRN×(d+1)X \in \mathbb{R}^{N \times (d+1)}.

Given the context, our cost function J(θ)J(\theta) is re-written as:

J(θ)=12N(Xθy)T(Xθy)J(\theta)=\frac{1}{2N}(\mathbf{X} \theta-\mathbf{y})^T(\mathbf{X} \theta-\mathbf{y}) \quad
Block

Let the gradient of the cost function J(θ) w.r.t θJ(\theta) \text{ w.r.t } \theta:

J(θ)=1N(XTXθXTy)\nabla J(\theta)=\frac{1}{N}\left( \mathbf{X}^T \mathbf{X} \theta-\mathbf{X}^T \mathbf{y} \right) \quad
Block

The gradient tells us the direction in which the cost function is increasing. When the gradient equals zero, we are at a minimum (or maximum), and thus, we can derive the optimal parameters θ^\hat{\theta}. To minimize the cost function, we solve for J(θ)=0\nabla J(\theta) = 0 and derive equation:

θ^=(XTX)1XTy\hat{\theta}=\left(\mathbf{X}^T \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{y} \quad
Block

The equation above is called the Normal Equation. We notice how in the process of solving the Normal Equation, we encounter the inverse operation (XTX)1(\mathbf{X}^T \mathbf{X})^{-1}.

  • If this inverse exists, only then can we solve for θ^\hat{\theta} by direct computation (The best case scenario).
  • If this inverse does not exist, meaning that (XTX)(\mathbf{X}^T \mathbf{X}) is not Full Rank, this leads to several implications:
    • There exist a linear dependency between the columns of XX (ie. multicollinearity). This means that some features are linearly dependent on others, which can cause issues in estimating the coefficients reliably.
    • We could find a way to reduce the dimensionality of XX by removing dependent features so that the Design Matrix XX is Full Rank, enabling the computation of the inverse.

Without delving into the specifics of dimensionality reduction and feature selection, we approach the problem by calculating Expectation. Let E(θ^X)\mathbb{E}(\hat{\theta} \mid \mathbf{X}) be the Expectation given θ^\hat{\theta} depending on XX

E(θ^X)=E((XTX)1XTyX)=((XTX)1XT)E(yX)=(X(XTX)1)xθ=θ\begin{aligned} \mathbb{E}(\hat{\theta} \mid \mathbf{X}) & =\mathbb{E}\left(\left(\mathbf{X}^T \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{y} \mid \mathbf{X}\right) \\ & =\left(\left(\mathbf{X}^T \mathbf{X}\right)^{-1} \mathbf{X}^T\right) \mathbb{E}(\mathbf{y} \mid \mathbf{X}) \quad \\ & =\left(\mathbf{X}\left(\mathbf{X}^T \mathbf{X}\right)^{-1}\right) \mathbf{x} \theta \\ & =\theta\end{aligned}
Block

The equation shows how we can compute the Expected Value of θ^\hat{\theta} conditioned on XX. By ensuring that E(θ^X)=θ\mathbb{E}(\hat{\theta} \mid \mathbf{X}) = \theta, we’re verifying that our estimation θ^\hat{\theta} is unbiased. This means that, on average, it accurately reflects the true parameter values.

By using formula: var(a+Ay)=Avar(y)AT\operatorname{var}(\mathbf{a}+\mathbf{A} \mathbf{y})=\mathbf{A} \operatorname{var}(\mathbf{y}) \mathbf{A}^T, with a,y\mathbf{a},\mathbf{y} as vectors and A\mathbf{A} as a constant matrix, we have:

var(θ^X)=var((XTX)1XTyX)=[(XTX)1XT]var(yX)[X(XTX)1]=[(XTX)1XT]σ2X[X(XTX)1]=σ2[(XTX)1XTX(XTX)1]=σ2(XTX)1\begin{aligned} \operatorname{var}(\hat{\theta} \mid \mathbf{X}) & =\operatorname{var}\left(\left(\mathbf{X}^T \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{y} \mid \mathbf{X}\right) \\ & =\left[\left(\mathbf{X}^T \mathbf{X}\right)^{-1} \mathbf{X}^T\right] \operatorname{var}(\mathbf{y} \mid \mathbf{X})\left[\mathbf{X}\left(\mathbf{X}^T \mathbf{X}\right)^{-1}\right] \quad \\ & =\left[\left(\mathbf{X}^T \mathbf{X}\right)^{-1} \mathbf{X}^T\right] \sigma^2 \mathbf{X}\left[\mathbf{X}\left(\mathbf{X}^T \mathbf{X}\right)^{-1}\right] \\ & =\sigma^2\left[\left(\mathbf{X}^T \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{X}\left(\mathbf{X}^T \mathbf{X}\right)^{-1}\right] \\ & =\sigma^2\left(\mathbf{X}^T \mathbf{X}\right)^{-1}\end{aligned}
Block

In which we see how the variance of θ^\hat{\theta} only depends on XX, not yy. The variance formula derived in the context of the estimation demonstrates that the uncertainty in our estimates only depends on the Design Matrix and not on the actual observed values. This characteristic is crucial for understanding how the choice of features influences the reliability of our estimates.

In summary, the Normal Equation provides a comprehensive method to derive the best-fitting parameters in linear regression by using matrix operations. It emphasizes the relationships among the independent variables and showcases how these relationships affect the predictions and their associated uncertainties. Understanding the Normal Equation not only enhances our comprehension of OLS but also equips us to handle practical scenarios, such as multicollinearity, that may arise when dealing with multiple regression problems.

QR Decomposition

In practice, the nature of the present of inverse operations in Normal Equation often leads to Numerical Instability, especially in larger datasets. This effect can result in inaccurate estimates of the regression coefficients, compromising the reliability of the model. To mitigate these issues, QR Decomposition is often employed in linear regression analysis.

QR Decomposition is a concept in Linear Algebra. “QR Decomposition - Wikipedia”. Some well-known QR Decomposition algorithm includes: Householder, Gram-Schmidt, etc…

Going back to our Design Matrix XRN×(d+1)X \in \mathbb{R}^{N \times (d+1)}. Assume we can find matrix QRN×(d+1)Q \in \mathbb{R}^{N \times (d+1)} and matrix RR(d+1)×(d+1)R \in \mathbb{R}^{(d+1) \times (d+1)} such that: X=QRX = QR, where:

  • QQ is an orthonormal column matrix, meaning QTQ=IQ^TQ = I.
  • RR is an upper triangular matrix, meaning that all elements above the main diagonal equals 0.

With matrices QQ and RR (Given RR is Full Rank) satisfied, we can denote:

X=QRXTX=(QR)T(QR)=RTQTQR=RTR(XTX)1=(RTR)1=R1(RT)1θ^=(XTX)1XTy=R1(RT)1(QR)Ty=R1(RT)1RTQTy=R1QTy.\begin{aligned} \mathbf{X} & =\mathbf{Q R} \\ \mathbf{X}^T \mathbf{X} & =(\mathbf{Q R})^T(\mathbf{Q R})=\mathbf{R}^T \mathbf{Q}^T \mathbf{Q} \mathbf{R}=\mathbf{R}^T \mathbf{R} \quad \\ \left(\mathbf{X}^T \mathbf{X}\right)^{-1} & =\left(\mathbf{R}^T \mathbf{R}\right)^{-1}=\mathbf{R}^{-1}\left(\mathbf{R}^T\right)^{-1} \\ \hat{\theta} & =\left(\mathbf{X}^T \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{y} \\ & =\mathbf{R}^{-1}\left(\mathbf{R}^T\right)^{-1}(\mathbf{Q R})^T \mathbf{y} \\ & =\mathbf{R}^{-1}\left(\mathbf{R}^T\right)^{-1} \mathbf{R}^T \mathbf{Q}^T \mathbf{y} \\ & =\mathbf{R}^{-1} \mathbf{Q}^T \mathbf{y} .\end{aligned}
Block

From the last equation, we can derive:

Rθ^=RR1QTy=QTy.\mathbf{R} \hat{\theta}=\mathbf{R} \mathbf{R}^{-1} \mathbf{Q}^{T} \mathbf{y}=\mathbf{Q}^{T} \mathbf{y}. \quad
Block

Given RR is an upper triangular matrix, we can easily solve for θ^\hat{\theta} with Gaussian Elimination, for example:

undefined not callable (property 'fill' of [object Array])
Block

Extra: Householder

The Householder algorithm transforms a given matrix into an upper triangular form by applying a sequence of reflections. Each reflection is designed to zero out specific elements below the diagonal, thereby simplifying the matrix. Process:

  • For each column of the matrix AA, a Householder vector is constructed. This vector reflects the column onto a multiple of the standard basis vector (essentially, it “flattens” the column).
  • The transformation is defined using a Householder matrix QjQ_j, which is an orthogonal matrix that represents the reflection.
  • The original matrix RR updated by applying the Householder transformation, while the orthogonal matrix QQ is accumulated by multiplying it with each Householder matrix’s transpose.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np

def householder_qr(A):
    m, n = A.shape
    Q = np.eye(m)
    R = A.copy()

    for j in range(n):
        x = R[j:, j]
        e = np.zeros_like(x)
        e[0] = 1
        u = x + np.sign(x[0]) * np.linalg.norm(x) * e
        v = u / np.linalg.norm(u)
        Q_j = np.eye(m)
        Q_j[j:, j:] -= 2.0 * np.outer(v, v)
        R = Q_j @ R
        Q = Q @ Q_j.T

    return Q, R
Python
Python implementation of the Householder algorithm

Think of the Householder transformation as using mirrors to reflect vectors. Each reflection aligns a column of the matrix with the direction of the coordinate axes, progressively creating zeros below the diagonal.

Extra: Gram-Schmidt

The Gram-Schmidt algorithm constructs an orthogonal basis from the columns of the matrix by iteratively orthogonalizing each column with respect to the previously computed columns. Process:

  • For each column of matrix AA, the algorithm subtracts projections of that column onto all previously computed orthogonal vectors.
  • The resulting vector is then normalized to create a unit vector, which forms part of the orthogonal matrix QQ.
  • The coefficients of these projections are stored in matrix RR, which contains the inner products of the original columns with the orthogonalized columns.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import numpy as np

def gram_schmidt_qr(A):
    m, n = A.shape
    Q = np.zeros((m, n))
    R = np.zeros((n, n))

    for j in range(n):
        v = A[:, j]
        for i in range(j):
            R[i, j] = np.dot(Q[:, i], A[:, j])
            v = v - R[i, j] * Q[:, i]
        R[j, j] = np.linalg.norm(v)
        Q[:, j] = v / R[j, j]

    return Q, R
Python
Python implementation of the Gram-Schmidt algorithm

Imagine you are trying to create a set of orthogonal vectors from a set of linearly independent vectors. The Gram-Schmidt process helps by systematically removing components that align with existing vectors, thus ensuring that each new vector is orthogonal to the ones before it.

For more a more comprehensive view on QR Decomposition, you can refer to “Walter Gander - Algorithms for the QR-Decomposition”

Conclusion

In conclusion, Linear Regression is a powerful yet lightweight tool in the realm of machine learning and statistics. Its simplicity allows for quick implementations and interpretations, making it an excellent choice for understanding relationships between variables and providing a solid baseline for predictive modeling. Despite its effectiveness, it is essential to recognize the limitations inherent in Linear Regression, such as its assumptions of linearity and sensitivity to outliers, which can impact the model’s performance.

I will be covering some examples of Linear Regression as part of my Assignment post for my course, which you can find here MAT3533 - Linear Regression Assignment.

Furthermore I will also implement Linear Regression from scratch in Python, which you can learn more about here Linear Regression - Python implementation from scratch.

Thank you for reading to the end, if there are any mistakes or suggestions to the material, feel free to contact me at Contact.

Extras:

  • L1, L2 Regularization for Linear Regression (Lasso, Ridge).
  • Visualization techniques for Linear Regression.
  • Data Preprocessing for Linear Regression.

References: