1. Decision Trees
Decision trees partition feature space with axis-aligned splits. At each node, the algorithm chooses the feature and threshold that best separates classes (Gini impurity or entropy for classification; MSE for regression).
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10)
tree.fit(X_train, y_train)
Trees are interpretable but prone to overfitting. Limit depth, require minimum samples per leaf, and prune after training.
2. Random Forests
Random forests combine hundreds of decorrelated trees trained on bootstrap samples with random feature subsets at each split. They reduce variance and often outperform single trees.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, max_features="sqrt", random_state=42)
rf.fit(X_train, y_train)
importances = rf.feature_importances_
3. Support Vector Machines
SVMs find the hyperplane that maximizes the margin between classes. With the kernel trick, they handle non-linear boundaries:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
pipe = Pipeline([("scale", StandardScaler()), ("svm", SVC(kernel="rbf", C=1.0))])
4. Cross-Validation & Hyperparameter Tuning
k-fold cross-validation splits training data into k folds, training on k−1 and validating on the held-out fold. This gives a more reliable performance estimate than a single split.
from sklearn.model_selection import GridSearchCV, cross_val_score
param_grid = {"n_estimators": [100, 200], "max_depth": [5, 10, None]}
search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring="f1_macro")
search.fit(X_train, y_train)
print(search.best_params_)
5. Feature Engineering
Strong features beat complex models. Techniques include:
- Encoding: one-hot for nominal categories, target encoding with care for leakage
- Scaling: StandardScaler, MinMaxScaler for distance-based models
- Polynomial features: capture interactions
- Datetime features: hour, day-of-week, is_weekend
- Binning: discretize continuous variables for tree models
6. Gradient Boosting
Boosting sequentially adds weak learners, each correcting the residual errors of the ensemble. XGBoost, LightGBM, and CatBoost dominate structured data competitions.
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(n_estimators=150, learning_rate=0.1, max_depth=4)
gb.fit(X_train, y_train)
Practical Tip
Start with a random forest baseline. If you need more performance, try gradient boosting with early stopping on a validation set.
Continue to ML Advanced for neural networks and transformers.