Table of Content

Block

Multinomial Logistic Regression

Given dataset X={(xi,yi)}i=1NX = \{ (x_i, y_i) \}_{i=1}^N where xiRnx_i \in \mathbb{R}^n and yi{1,2,,C}y_i \in \{ 1,2,\ldots,C \}, the Multinomial Logistic Regression model states that the likelihood of the input xx belonging to class y=cy=c is:

P(y=cx,Θ,B)=exp(zc)κCexp(zκ) ;zκ=xTθκ+βκP(y = c \mid x, \Theta, \mathcal{B}) = \frac{\exp(z_c)}{\sum^C_\kappa \exp(z_\kappa)} \space ; \quad z_\kappa = x^T \theta_\kappa + \beta_\kappa \quad
Block
  • Where Θ={θ1,θ2,,θC}\Theta = \{ \theta_1, \theta_2, \ldots, \theta_C \} are the weights and B={β1,β2,,βC}\mathcal{B} = \{ \beta_1, \beta_2, \ldots, \beta_C \} are the biases and they make up the model’s parameters.
  • The set Z={z1,z2,,zC}Z = \{ z_1, z_2, \ldots, z_C \} are called the logits for each of the CC classes, which contains the linear combinations of the input xx with the weights and biases.
  • The expression exp(zc)κCexp(zκ)\frac{\exp(z_c)}{\sum^C_\kappa \exp(z_\kappa)} is the cc-th element of P={P(y=1x,Θ,B),,P(y=Cx,Θ,B)}\mathcal{P} = \{ P(y=1 \mid x, \Theta, \mathcal{B}), \ldots, P(y=C \mid x, \Theta, \mathcal{B}) \} produced by the Softmax Function, in nature it maps the CC linear combinations ZZ of the input xx and the weights Θ\Theta and biases B\mathcal{B} to a probability distribution that sums up to 1. In other word, the softmax function takes ZRCZ \in \mathbb{R}^C and produces PRC\mathcal{P} \in \mathbb{R}^C such that cCP(y=cx,Θ,B)=1\sum_c^C P(y=c \mid x, \Theta, \mathcal{B}) = 1.

Here is a visualization of the multinomial logistic regression model in the style of a neural network for queried input xqx_q:

softmaxlogreg

Our goal is to optimize the parameters Θ\Theta and B\mathcal{B}, given MLE equation:

L(Θ,B)=i=1Nc=1C(exp(xiTθc+βc)κ=1Cexp(xiTθκ+βκ))1[yi=c]L(\Theta, \mathcal{B}) = \prod^N_{i=1} \prod^C_{c=1} \left( \frac{\exp(x_i^T \theta_c + \beta_c)}{\sum_{\kappa=1}^C \exp(x_i^T \theta_\kappa + \beta_\kappa)} \right)^{1[y_i=c]} \quad
Block
  • Where 1[yi=c]1[y_i=c] is an indicator function that equals 1 if yi=cy_i = c and 0 otherwise.

We take the Log Likelihood to simplify the calculation of MLE:

logL(Θ,B)=i=1Nc=1C1[y1=c]log(exp(xiTθc+βc)κ=1Cexp(xiTθκ+βκ))=i=1Nc=1C1[y1=c](xiTθc+βclog(κ=1Cexp(xiTθκ+βκ)))\begin{gather*} \log L(\Theta, \mathcal{B}) = \sum^N_{i=1} \sum_{c=1}^C 1[y_1=c] \log \left( \frac{\exp(x_i^T \theta_c + \beta_c)}{\sum_{\kappa=1}^C \exp(x_i^T \theta_\kappa + \beta_\kappa)} \right) \quad \\[15pt] = \sum^N_{i=1} \sum_{c=1}^C 1[y_1=c] \left( x_i^T \theta_c + \beta_c - \log \left( \sum_{\kappa=1}^C \exp(x_i^T \theta_\kappa + \beta_\kappa) \right) \right) \quad \end{gather*}
Block

To maximize the MLE, we maximize the Log Likelihood, which is equivalent to minimizing the negative Log Likelihood, thus we define the cost function:

J(Θ,B)=1NlogL(Θ,B)\begin{gather*} J(\Theta, \mathcal{B}) = -\frac{1}{N} \log L(\Theta, \mathcal{B}) \end{gather*}
Block

We use Gradient Descent to solve the optimization problem iteratively:

  • Step 1: Initialize parameters randomly: Θ,B\Theta, \mathcal{B}.
  • Step 2: Calculate the Gradient of J(Θ,B)J(\Theta, \mathcal{B}) with respect to each of the logit’s weights θc\theta_c and biases βc\beta_c:
Jθc=1Ni=1N(1[yi=c]exp(xiTθc+βc)κ=1Cexp(xiTθκ+βκ))xiJβc=1Ni=1N(1[yi=c]exp(xiTθc+βc)κ=1Cexp(xiTθκ+βκ))\begin{gather*} \frac{\partial J}{\partial \theta_c} = -\frac{1}{N} \sum_{i=1}^N \left(1[y_i=c] - \frac{\exp(x_i^T \theta_c + \beta_c)}{\sum_{\kappa=1}^C \exp(x_i^T \theta_\kappa + \beta_\kappa)}\right)x_i \quad \\[15pt] \frac{\partial J}{\partial \beta_c} = -\frac{1}{N} \sum_{i=1}^N \left(1[y_i=c] - \frac{\exp(x_i^T \theta_c + \beta_c)}{\sum_{\kappa=1}^C \exp(x_i^T \theta_\kappa + \beta_\kappa)}\right) \end{gather*}
Block

These gradients will give us the direction of the weights and biases that will cause the steepest descent towards a local minima.

  • Step 3: Update parameters:
θcnew=θc+αJθcβcnew=βc+αJβc\begin{gather*} \theta_c^{\text{new}} = \theta_c + \alpha \frac{\partial J}{\partial \theta_c} \quad \\[15pt] \beta_c^{\text{new}} = \beta_c + \alpha \frac{\partial J}{\partial \beta_c} \end{gather*}
Block

Where α\alpha denotes the learning rate. We perform step 2 and 3 for all logits from 11 to CC.

Often times, regularization is introduced to reduce overfitting, we will focus on L2 regularization, we implement it by simply adding a regularization term into the cost function:

Jreg(Θ,B)=1NlogL(Θ,B)+λ2Nc=1Cθc22J_{\text{reg}}(\Theta, \mathcal{B}) = -\frac{1}{N} \log L(\Theta, \mathcal{B}) + \frac{\lambda}{2N} \sum_{c=1}^C \| \theta_c \|^2_2 \quad
Block

Where λ\lambda is a constant that represents the regularization strength. This will affect the weights of the model, large λ\lambda causes the weights to be closer to 0 and needs to be tune strategically to properly reduce overfitting. While the biases updates remain unchanged, our weights updates will be affected:

θcnew=θc+αJθcλNθc\theta_c^{\text{new}} = \theta_c + \alpha \frac{\partial J}{\partial \theta_c} - \frac{\lambda}{N} \theta_c \quad
Block

Binary Logistic Regression

When there are C=2C = 2 classes, the softmax function is reduced to a special case called the Sigmoid Function, for simplicity we let yi{0,1}y_i \in \{ 0, 1 \}:

undefined not callable (property 'fill' of [object Array])
Block

Let θ=θ2θ1\theta^{'} = \theta_2 - \theta_1 and β=β2β1\beta^{'} = \beta_2 - \beta_1, we have:

11+exp((xiTθ+β)) ;111+exp((xiTθ+β))\frac{1}{1+\exp(-(x_i^T \theta^{'} + \beta^{'}))} \space ; \quad 1 - \frac{1}{1+\exp(-(x_i^T \theta^{'} + \beta^{'}))}
Block

Which denotes the probability of class 0 and 1 respectively, notice that both of the expressions above still sum up to 1. Therefore, when C=2C = 2, we only have to optimize for 1 set of weights θ\theta^{'} and a single bias β\beta^{'}. Given dataset X={(xi,yi)}i=1NX = \{ (x_i, y_i) \}_{i=1}^N where xiRnx_i \in \mathbb{R}^n and yi{0,1}y_i \in \{ 0,1\}, the Binary Logistic Regression model or simply Logistic Regression model is represented as:

P(yx,θ,β)=(σ(z))y(1σ(z))1y ;z=xTθ+βP(y \mid x, \theta, \beta) = (\sigma(z))^y(1-\sigma(z))^{1-y} \space ; \quad z = x^T\theta^{'} + \beta^{'} \quad
Block
  • Where the weights θ={θ1,θ2,,θn}\theta^{'} = \{\theta_1, \theta_2, \ldots, \theta_n\} and bias βR\beta^{'} \in \mathbb{R} are the model’s parameters.
  • The expression σ(z)=11+exp(z)\sigma(z) = \frac{1}{1+\exp(-z)} denotes the sigmoid function, and z=xTθ+βz = x^T\theta^{'} + \beta^{'} denotes a linear combination of the input xx and the weights and bias, this function tends to 00 as zz \rightarrow -\infty and tends to 11 as z+z \rightarrow +\infty, and σ(z)+(1σ(z))=1\sigma(z) + (1-\sigma(z)) = 1.

Here is a visualization of the binary logistic regression model in the style of a neural network for queried input xqx_q:

sigmoidlogreg

Our calculations will look a bit different:

  • MLE:
L(θ,β)=i=1NP(yixi,θ,β)L(\theta^{'}, \beta^{'}) = \prod_{i=1}^N P(y_i \mid x_i, \theta^{'}, \beta^{'}) \quad
Block
  • Log Likelihood:
logL(θ,β)=i=1N[yilogσ(zi)+(1yi)log(1σ(z))]\log L(\theta^{'}, \beta^{'}) = \sum_{i=1}^N [y_i \log \sigma (z_i) + (1 - y_i) \log (1 - \sigma(z))] \quad
Block
  • Cost function:
J(θ,β)=1NlogL(θ,β)J(\theta^{'}, \beta^{'}) = - \frac{1}{N} \log L(\theta^{'}, \beta^{'}) \quad
Block
  • Gradient Descent:
Jθ=1Ni=1N(σ(zi)yi)xiJβ=1Ni=1N(σ(zi)yi)θ new=θ+αJθβ new=β+αJβ\begin{gather*} \frac{\partial J}{\partial \theta^{'}} = -\frac{1}{N} \sum_{i=1}^N (\sigma (z_i)- y_i)x_i \quad \\[10pt] \frac{\partial J}{\partial \beta^{'}} = -\frac{1}{N} \sum_{i=1}^N (\sigma (z_i)- y_i) \\[10pt] \theta^{'\text{ new}} = \theta^{'} + \alpha \frac{\partial J}{\partial \theta^{'}} \\[10pt] \beta^{'\text{ new}} = \beta^{'} + \alpha \frac{\partial J}{\partial \beta^{'}} \end{gather*}
Block
  • L2 Regularization:
Jreg(θ,β)=1NlogL(θ,β)+12Nθ22θ new=θ+αJθ+1Nθ\begin{gather*} J_{\text{reg}} (\theta^{'}, \beta^{'}) = -\frac{1}{N} \log L(\theta^{'}, \beta^{'}) + \frac{1}{2N} \| \theta^{'} \|^2_2 \quad \\[10pt] \theta^{'\text{ new}} = \theta^{'} + \alpha \frac{\partial J}{\partial \theta^{'}} + \frac{1}{N} \theta^{'} \end{gather*}
Block