1. What is Machine Learning?
Machine Learning (ML) is a subset of artificial intelligence where computers learn patterns from data rather than being explicitly programmed with rules for every scenario. Instead of writing if temperature > 30 then... for thousands of cases, we show the algorithm many examples and let it discover the relationship.
ML powers recommendation systems (Netflix, Spotify), spam filters, medical diagnosis aids, autonomous vehicles, and language models. The unifying idea: generalize from experience.
Three Types of Learning
Supervised: labeled data (input → known output). Unsupervised: find structure without labels. Reinforcement: learn through rewards in an environment.
2. Python Environment Setup
Python is the lingua franca of ML thanks to libraries like NumPy, pandas, and scikit-learn. Install Python 3.10+ and create a virtual environment:
python -m venv ml-env
# Windows:
ml-env\Scripts\activate
# macOS/Linux:
source ml-env/bin/activate
pip install numpy pandas scikit-learn matplotlib
3. NumPy — Numerical Computing
Neural networks and ML algorithms operate on matrices. NumPy provides fast n-dimensional arrays:
import numpy as np
X = np.array([[1, 2], [3, 4], [5, 6]]) # 3 samples, 2 features
y = np.array([3, 7, 11]) # targets
# Mean of each feature column
print(X.mean(axis=0)) # [3. 4.]
# Dot product — foundation of linear models
weights = np.array([0.5, 1.0])
predictions = X @ weights
4. Working with Data — pandas
Real datasets arrive as CSV, JSON, or database tables. pandas wraps NumPy with labeled columns:
import pandas as pd
df = pd.read_csv("housing.csv")
print(df.head())
print(df.describe()) # mean, std, quartiles
# Handle missing values
df["bedrooms"].fillna(df["bedrooms"].median(), inplace=True)
# Feature matrix and target vector
X = df[["sqft", "bedrooms", "age"]]
y = df["price"]
5. Supervised Learning Workflow
- Collect & clean data — handle missing values, outliers, encoding categories.
- Split data — training set to learn, test set to evaluate generalization.
- Choose a model — start simple (linear regression) before complex models.
- Train — fit parameters to minimize a loss function.
- Evaluate — measure performance on unseen data.
6. Linear Regression
Linear regression predicts a continuous target as a weighted sum of features plus a bias term. Given features \(x\) and target \(y\), we model:
We find weights that minimize the Mean Squared Error (MSE):
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
print("R² score:", model.score(X_test, y_test))
Try our Interactive Regression Playground to see MSE update as you add points.
7. Logistic Regression for Classification
When the target is categorical (spam/not spam, disease/healthy), we use logistic regression. It applies the sigmoid function to output probabilities between 0 and 1:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
proba = clf.predict_proba(X_test) # probability per class
preds = clf.predict(X_test)
8. Evaluation Metrics
Regression: MSE, RMSE, MAE, R². Classification: accuracy, precision, recall, F1-score, ROC-AUC.
- Precision — of predicted positives, how many are correct?
- Recall — of actual positives, how many did we find?
- F1 — harmonic mean of precision and recall.
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, preds))
9. Bias, Variance & Overfitting
Underfitting (high bias): model too simple to capture patterns. Overfitting (high variance): model memorizes training noise and fails on new data.
Golden Rule
Never tune hyperparameters or select models using the test set. Use a validation set or cross-validation on training data only.
Next step: Continue to ML Intermediate for decision trees, ensembles, and cross-validation.