Intermediate

Machine Learning — Practitioner Track

⏱ ~16 hours📚 14 modules

1. Decision Trees

Decision trees partition feature space with axis-aligned splits. At each node, the algorithm chooses the feature and threshold that best separates classes (Gini impurity or entropy for classification; MSE for regression).

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10)
tree.fit(X_train, y_train)

Trees are interpretable but prone to overfitting. Limit depth, require minimum samples per leaf, and prune after training.

2. Random Forests

Random forests combine hundreds of decorrelated trees trained on bootstrap samples with random feature subsets at each split. They reduce variance and often outperform single trees.

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=200, max_features="sqrt", random_state=42)
rf.fit(X_train, y_train)
importances = rf.feature_importances_

3. Support Vector Machines

SVMs find the hyperplane that maximizes the margin between classes. With the kernel trick, they handle non-linear boundaries:

\[ K(\mathbf{x}_i, \mathbf{x}_j) = \exp\left(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2\right) \quad \text{(RBF kernel)} \]
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

pipe = Pipeline([("scale", StandardScaler()), ("svm", SVC(kernel="rbf", C=1.0))])

4. Cross-Validation & Hyperparameter Tuning

k-fold cross-validation splits training data into k folds, training on k−1 and validating on the held-out fold. This gives a more reliable performance estimate than a single split.

from sklearn.model_selection import GridSearchCV, cross_val_score

param_grid = {"n_estimators": [100, 200], "max_depth": [5, 10, None]}
search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring="f1_macro")
search.fit(X_train, y_train)
print(search.best_params_)

5. Feature Engineering

Strong features beat complex models. Techniques include:

  • Encoding: one-hot for nominal categories, target encoding with care for leakage
  • Scaling: StandardScaler, MinMaxScaler for distance-based models
  • Polynomial features: capture interactions
  • Datetime features: hour, day-of-week, is_weekend
  • Binning: discretize continuous variables for tree models

6. Gradient Boosting

Boosting sequentially adds weak learners, each correcting the residual errors of the ensemble. XGBoost, LightGBM, and CatBoost dominate structured data competitions.

from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(n_estimators=150, learning_rate=0.1, max_depth=4)
gb.fit(X_train, y_train)

Practical Tip

Start with a random forest baseline. If you need more performance, try gradient boosting with early stopping on a validation set.

Continue to ML Advanced for neural networks and transformers.