Advanced

Machine Learning — Deep & Modern Methods

⏱ ~20 hours📚 16 modules

1. Neural Networks from First Principles

A neural network stacks layers of neurons. Each neuron computes a weighted sum followed by a non-linear activation \(\sigma\):

\[ a^{[l]} = \sigma\left(W^{[l]} a^{[l-1]} + b^{[l]}\right) \]
Deep neural network diagram

Common activations: ReLU \(\max(0, z)\), sigmoid, tanh, softmax (output layer for multi-class).

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, input_dim, hidden, num_classes):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, num_classes),
        )
    def forward(self, x):
        return self.net(x)

Visualize architectures in our Neural Network Simulator.

2. Backpropagation & Gradient Descent

Training minimizes loss \(L\) via gradient descent. Backpropagation applies the chain rule to compute \(\partial L / \partial W^{[l]}\) efficiently:

\[ W := W - \alpha \frac{\partial L}{\partial W} \]

Modern optimizers (Adam, AdamW) adapt learning rates per parameter. Use learning rate schedules and gradient clipping for stability.

3. Convolutional Neural Networks

CNNs exploit spatial locality in images via convolution filters, pooling, and hierarchical feature learning. Architectures: LeNet, AlexNet, ResNet, EfficientNet.

model = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(32, 64, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.AdaptiveAvgPool2d(1),
    nn.Flatten(),
    nn.Linear(64, 10),
)

4. Recurrent Networks & Sequences

RNNs, LSTMs, and GRUs process sequential data (text, time series). Hidden state \(h_t\) depends on previous step: \(h_t = f(W x_t + U h_{t-1} + b)\). Transformers have largely replaced RNNs for NLP due to parallelization and long-range dependencies.

5. Transformers & Attention

Self-attention computes relationships between all token pairs. Scaled dot-product attention:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V \]

Multi-head attention runs parallel attention heads. Encoder-decoder transformers power GPT, BERT, and modern vision models (ViT).

6. Production Machine Learning

  • Data versioning — DVC, lakehouse patterns
  • Experiment tracking — MLflow, Weights & Biases
  • Model serving — ONNX, TorchServe, Triton
  • Monitoring — data drift, concept drift, latency SLAs

Responsible Deployment

Evaluate fairness across demographic groups, document limitations, and maintain human oversight for high-stakes decisions.