Mathematical Building Blocks of Neural Networks#

Machine Learning Methods#

Module 7: Neural Networks#

Part 1: Mathematical Building Blocks of Neural Networks#

Instructor: Farhad Pourkamali#

Overview#


  • Video: https://www.youtube.com/watch?v=3pFDgR245HU

  • So far, we have seen two linear models for regression and classification problems

    • Linear regression: \(p(y|\mathbf{x},\mathbf{w})=\mathcal{N}(\mathbf{w}^T\mathbf{x}, \epsilon)\), where \(\epsilon\) is a fixed variance for all inputs

    • Logistic regression: \(p(y|\mathbf{x},\mathbf{w})=\text{Ber}(y|\sigma(\mathbf{w}^T\mathbf{x}))\), where \(\sigma\) is the sigmoid or logistic function

  • These models make the strong assumption that the input-output mapping is linear

    • A better idea: we can equip the feature extractor with its own “trainable” parameters

    \[\mathbf{w}^T\phi(\mathbf{x};\boldsymbol{\theta})\]
    • Illustrative example in the next slide!

Example, part 1: feature extractor#


  • Consider a classification problem with 3 input features \(x_1,x_2,x_3\)

Example, part 2: classifier#


  • This is the key idea behind multilayer perceptron (MLP) for “structured” or “tabular” data \(\mathbf{x}\in\mathbb{R}^D\)

1. Feedforward networks#


  • The information flows through the function being evaluated from \(\mathbf{x}\), through the intermediate computations, and finally to the output

  • For example, we might have \(f^{(1)}, f^{(2)}, f^{(3)}\) connected in a chain to form

\[f(\mathbf{x})=f^{(3)}\Big(f^{(2)}\big(f^{(1)}(\mathbf{x})\big)\Big)\]
  • \(f^{(1)}\): the first layer, \(f^{(2)}\): the second layer of the network

    • Because the training data does not show the desired output for these layers, they are called hidden layers

  • The overall length of the chain gives the depth, and the dimensionality of hidden layers determines the width of the model

Classical example: The XOR problem#


  • Learn a function that computes the exclusive OR of its two binary outputs (4 data points in \(\mathbb{R}^2\))

    • First activation function: \(g(z)=\text{ReLU}(z)=\max\{0,z\}\) and no activation function for the last layer

Classical example: The XOR problem#


  • This network correctly predicts all four class labels

Why nonlinear activation functions?#


  • If we just use a linear activation function, then the whole model reduces to a regular linear model

\[f(\mathbf{x})=\mathbf{W}^{(L)}\ldots\mathbf{W}^{(2)}\mathbf{W}^{(1)}\mathbf{x}=\mathbf{W}\mathbf{x}\]
  • Therefore, it is important to use nonlinear activation functions

    • Sigmoid (logistic) function: \(\sigma(a)=\frac{1}{1+\exp(-a)}\)

    • rectified linear unit or ReLU: \(\text{ReLU}(a)=\max\{0, a\}\)

2. A tour of activation functions#


  • ReLU (Rectified Linear Unit)

    • ReLU is known for its simplicity and has been widely used in deep learning due to its ability to introduce non-linearity into the network

    \[\text{ReLU}(x)=\max\{0, x\}\]
  • ELU (Exponential Linear Unit)

    • ELU addresses the “dying ReLU” problem by allowing negative values, which helps to keep gradients flowing during training

    • \(\alpha\) is a hyperparameter that controls the output of the function for negative values of \(x\)

    \[\begin{split}\text{ELU}(x)=\begin{cases}x & \text{if } x \geq 0 \\ \alpha \cdot \big(\exp(x) - 1\big) & \text{if } x < 0 \end{cases}\end{split}\]
  • Softplus

    • Softplus is a smooth approximation of the ReLU function

    • It has the advantage of being differentiable everywhere, which makes it suitable for gradient-based optimization algorithms

    \[\text{Softplus}(x)=\log\big(1+\exp(x)\big)\]
import numpy as np
import matplotlib.pyplot as plt

# Define a range of x values
x = np.linspace(-5, 5, 200)  # Adjust the range as needed

# Define the ReLU function
def relu(x):
    return np.maximum(0, x)

# Define the ELU function
def elu(x, alpha=2.0):
    return np.where(x >= 0, x, alpha * (np.exp(x) - 1))

# Define the Softplus function
def softplus(x):
    return np.log(1 + np.exp(x))


# Plot the activation functions
plt.figure(figsize=(8, 3))

plt.subplot(1, 3, 1)
plt.title("ReLU")
plt.plot(x, relu(x), 'r-')
plt.grid()

plt.subplot(1, 3, 2)
plt.title(r"ELU ($\alpha=2$)")
plt.plot(x, elu(x))
plt.plot(x, relu(x), 'r:')
plt.grid()

plt.subplot(1, 3, 3)
plt.title("Softplus")
plt.plot(x, softplus(x))
plt.plot(x, relu(x), 'r:')
plt.grid()

plt.tight_layout()
plt.show()
../_images/3909b5ab7f110e764f7ca8542d4847a2cb11d719592e2369fcfee5483ce58e79.png