Beginner

Machine Learning — Foundations

⏱ ~12 hours📚 12 modules🐍 Python

1. What is Machine Learning?

Machine Learning (ML) is a subset of artificial intelligence where computers learn patterns from data rather than being explicitly programmed with rules for every scenario. Instead of writing if temperature > 30 then... for thousands of cases, we show the algorithm many examples and let it discover the relationship.

ML powers recommendation systems (Netflix, Spotify), spam filters, medical diagnosis aids, autonomous vehicles, and language models. The unifying idea: generalize from experience.

Three Types of Learning

Supervised: labeled data (input → known output). Unsupervised: find structure without labels. Reinforcement: learn through rewards in an environment.

2. Python Environment Setup

Python is the lingua franca of ML thanks to libraries like NumPy, pandas, and scikit-learn. Install Python 3.10+ and create a virtual environment:

python -m venv ml-env
# Windows:
ml-env\Scripts\activate
# macOS/Linux:
source ml-env/bin/activate
pip install numpy pandas scikit-learn matplotlib

3. NumPy — Numerical Computing

Neural networks and ML algorithms operate on matrices. NumPy provides fast n-dimensional arrays:

import numpy as np

X = np.array([[1, 2], [3, 4], [5, 6]])  # 3 samples, 2 features
y = np.array([3, 7, 11])                   # targets

# Mean of each feature column
print(X.mean(axis=0))  # [3. 4.]

# Dot product — foundation of linear models
weights = np.array([0.5, 1.0])
predictions = X @ weights

4. Working with Data — pandas

Real datasets arrive as CSV, JSON, or database tables. pandas wraps NumPy with labeled columns:

import pandas as pd

df = pd.read_csv("housing.csv")
print(df.head())
print(df.describe())  # mean, std, quartiles

# Handle missing values
df["bedrooms"].fillna(df["bedrooms"].median(), inplace=True)

# Feature matrix and target vector
X = df[["sqft", "bedrooms", "age"]]
y = df["price"]

5. Supervised Learning Workflow

  1. Collect & clean data — handle missing values, outliers, encoding categories.
  2. Split data — training set to learn, test set to evaluate generalization.
  3. Choose a model — start simple (linear regression) before complex models.
  4. Train — fit parameters to minimize a loss function.
  5. Evaluate — measure performance on unseen data.

6. Linear Regression

Linear regression predicts a continuous target as a weighted sum of features plus a bias term. Given features \(x\) and target \(y\), we model:

\[ \hat{y} = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b = \mathbf{w}^T \mathbf{x} + b \]

We find weights that minimize the Mean Squared Error (MSE):

\[ J(\mathbf{w}, b) = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2 \]
Linear Regression
Linear regression finds the line that minimizes squared vertical distances to points.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
print("R² score:", model.score(X_test, y_test))

Try our Interactive Regression Playground to see MSE update as you add points.

7. Logistic Regression for Classification

When the target is categorical (spam/not spam, disease/healthy), we use logistic regression. It applies the sigmoid function to output probabilities between 0 and 1:

\[ h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}} \]
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
proba = clf.predict_proba(X_test)  # probability per class
preds = clf.predict(X_test)

8. Evaluation Metrics

Regression: MSE, RMSE, MAE, R². Classification: accuracy, precision, recall, F1-score, ROC-AUC.

  • Precision — of predicted positives, how many are correct?
  • Recall — of actual positives, how many did we find?
  • F1 — harmonic mean of precision and recall.
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, preds))

9. Bias, Variance & Overfitting

Underfitting (high bias): model too simple to capture patterns. Overfitting (high variance): model memorizes training noise and fails on new data.

Golden Rule

Never tune hyperparameters or select models using the test set. Use a validation set or cross-validation on training data only.

Next step: Continue to ML Intermediate for decision trees, ensembles, and cross-validation.

Take ML Quiz