Experiments: run rigorous ML experiments

Same model, same data, three runs — 0.89, 0.85, 0.87. Without experiment discipline, none of these are defensible.

This guide covers the practices that make ML experiments reproducible, comparable, and auditable: seed control, data splitting, configuration management, tracking, and ablation. The goal is not to track everything — it is to track enough that any result can be explained, challenged, and reproduced. The examples use binary classification with scikit-learn (roc_auc, stratify, predict_proba). The principles apply to regression and multi-class — substitute the metric and splitting strategy.


On this page


Key concepts

  • Experiment: a question you are trying to answer (“does model A beat model B?”, “does feature X help?”). One independent variable, one fixed evaluation protocol.
  • Run: a single execution with a fixed configuration. An experiment produces many runs.
  • Ablation: removing one component at a time to measure its individual contribution. If performance drops, the component matters.
  • Seed: the starting state for random number generators. Fixing seeds makes stochastic processes repeatable.
  • Baseline: a reference model that sets the performance floor and bar. A trivial baseline (e.g., predict the majority class) proves the features carry signal; a simple baseline (e.g., logistic regression) proves the added complexity of a fancier model is justified.
  • Config-driven experiment: all parameters live in configuration files — the config is the single source of truth for what a run did.

Resources

baseline.example.yaml
# Experiment config — binary classification with XGBoost, 10-seed evaluation.

experiment_name: model_comparison

seeds: [42, 123, 456, 789, 1024, 2048, 3333, 5555, 7777, 9999]

data:
  path: data/processed/features_v1.parquet
  target: is_fraud
  test_size: 0.2
  drop_columns: []
  data_version: v1

model:
  name: xgboost
  params:
    n_estimators: 500
    max_depth: 6
    learning_rate: 0.1
    subsample: 0.8
    colsample_bytree: 0.8

cv_splits: 5
scoring: roc_auc
config.example.py
"""Pydantic config schema for experiment configurations."""

import argparse
from pathlib import Path
from typing import Any, Self

import yaml
from pydantic import BaseModel, Field


class DataConfig(BaseModel):
    path: str
    target: str
    test_size: float = Field(default=0.2, gt=0, lt=1)
    drop_columns: list[str] = []
    data_version: str = "v1"


class ModelConfig(BaseModel):
    name: str
    params: dict[str, Any] = {}


class ExperimentConfig(BaseModel):
    experiment_name: str
    seeds: list[int] = Field(default=[42, 123, 456, 789, 1024], min_length=1)
    data: DataConfig
    model: ModelConfig
    cv_splits: int = Field(default=5, ge=2)
    scoring: str = "roc_auc"

    @classmethod
    def from_yaml(cls, path: str) -> Self:
        raw = yaml.safe_load(Path(path).read_text())
        return cls(**raw)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--config", required=True, help="Path to YAML config")
    args = parser.parse_args()

    cfg = ExperimentConfig.from_yaml(args.config)
    print(cfg.model_dump_json(indent=2))
data_splitting.example.py
"""Data splitting patterns for ML experiments.

Requires config.py (from config.example.py) on the import path.
"""

import argparse
from typing import Any

import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.metrics import get_scorer
from sklearn.model_selection import (
    GroupKFold,
    GroupShuffleSplit,
    StratifiedKFold,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier

from config import ExperimentConfig


def stratified_split(
    X: pd.DataFrame, y: pd.Series, cfg: ExperimentConfig
) -> tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:
    """Hold out a stratified test set before any training or tuning."""
    return train_test_split(
        X, y,
        test_size=cfg.data.test_size,
        stratify=y,
        random_state=cfg.seeds[0],
    )


def stratified_cv(
    pipe: Pipeline, X_train: pd.DataFrame, y_train: pd.Series,
    cfg: ExperimentConfig, seed: int,
) -> dict[str, Any]:
    """Run cross-validation with preprocessing inside the pipeline to prevent leakage."""
    cv = StratifiedKFold(n_splits=cfg.cv_splits, shuffle=True, random_state=seed)
    return cross_validate(pipe, X_train, y_train, cv=cv, scoring=cfg.scoring)


def group_split(
    X: pd.DataFrame, y: pd.Series, groups: pd.Series, cfg: ExperimentConfig
) -> tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:
    """Split respecting entity groups — no entity appears in both train and test."""
    gss = GroupShuffleSplit(
        n_splits=1, test_size=cfg.data.test_size, random_state=cfg.seeds[0]
    )
    train_idx, test_idx = next(gss.split(X, y, groups=groups))

    return (
        X.iloc[train_idx], X.iloc[test_idx],
        y.iloc[train_idx], y.iloc[test_idx],
    )


def group_cv(
    pipe: Pipeline,
    X_train: pd.DataFrame,
    y_train: pd.Series,
    groups_train: pd.Series,
    cfg: ExperimentConfig,
) -> dict[str, Any]:
    """Cross-validation respecting groups within the training set."""
    cv = GroupKFold(n_splits=cfg.cv_splits)
    return cross_validate(
        pipe, X_train, y_train, cv=cv, groups=groups_train, scoring=cfg.scoring
    )


def early_stopping_cv(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    estimator: BaseEstimator,
    cfg: ExperimentConfig,
    seed: int,
    preprocessor: TransformerMixin | None = None,
) -> list[float]:
    """Manual CV loop with early stopping — passes eval_set directly to the estimator.

    Scoring uses cfg.scoring via sklearn's get_scorer, so the metric must be
    compatible with predict_proba (e.g. roc_auc, average_precision, neg_log_loss).
    """
    scorer = get_scorer(cfg.scoring)
    cv = StratifiedKFold(n_splits=cfg.cv_splits, shuffle=True, random_state=seed)
    fold_scores: list[float] = []

    for fold, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
        X_fold_train = X_train.iloc[train_idx]
        X_fold_val = X_train.iloc[val_idx]
        y_fold_train = y_train.iloc[train_idx]
        y_fold_val = y_train.iloc[val_idx]

        if preprocessor is not None:
            pre = clone(preprocessor)
            X_fold_train = pre.fit_transform(X_fold_train)
            X_fold_val = pre.transform(X_fold_val)

        model = clone(estimator)
        model.fit(
            X_fold_train, y_fold_train,
            eval_set=[(X_fold_val, y_fold_val)],
            verbose=False,
        )

        score = scorer(model, X_fold_val, y_fold_val)
        fold_scores.append(score)
        print(f"Fold {fold}: {score:.4f} (stopped at iteration {model.best_iteration})")

    return fold_scores


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--config", required=True, help="Path to YAML config")
    args = parser.parse_args()

    cfg = ExperimentConfig.from_yaml(args.config)
    df = pd.read_parquet(cfg.data.path)
    X = df.drop(columns=[cfg.data.target] + cfg.data.drop_columns)
    y = df[cfg.data.target]

    X_train, X_test, y_train, y_test = stratified_split(X, y, cfg)
    print(f"Train: {len(X_train)}, Test: {len(X_test)}")

    for seed in cfg.seeds:
        print(f"\n--- Seed {seed} ---")

        pipe = Pipeline([
            ("scaler", StandardScaler()),
            ("model", XGBClassifier(**cfg.model.params, random_state=seed)),
        ])
        results = stratified_cv(pipe, X_train, y_train, cfg, seed)
        print(f"Stratified CV: {results['test_score'].mean():.4f} ± {results['test_score'].std():.4f}")

        model = XGBClassifier(
            **cfg.model.params, random_state=seed, early_stopping_rounds=20
        )
        scores = early_stopping_cv(
            X_train, y_train, model, cfg, seed, preprocessor=StandardScaler()
        )
        print(f"Early stopping CV: {sum(scores) / len(scores):.4f}")
train.example.py
"""Multi-seed training with cross-validation and MLflow tracking.

Requires config.py (from config.example.py) on the import path.
"""

import argparse
import subprocess
import tempfile
import time
from pathlib import Path

import mlflow
import numpy as np
import pandas as pd
from sklearn.model_selection import (
    StratifiedKFold,
    cross_val_predict,
    cross_validate,
    train_test_split,
)
from xgboost import XGBClassifier

from config import ExperimentConfig


def require_clean_worktree() -> str:
    """Fail fast if the working tree has uncommitted changes."""
    result = subprocess.run(
        ["git", "status", "--porcelain"],
        capture_output=True, text=True, check=True,
    )
    if result.stdout.strip():
        raise RuntimeError(
            "Uncommitted changes detected — commit or stash before running experiments.\n"
            f"{result.stdout}"
        )
    commit_hash = subprocess.run(
        ["git", "rev-parse", "HEAD"],
        capture_output=True, text=True, check=True,
    ).stdout.strip()
    return commit_hash


def main(config_path: str) -> None:
    commit_hash = require_clean_worktree()
    cfg = ExperimentConfig.from_yaml(config_path)

    df = pd.read_parquet(cfg.data.path)
    X = df.drop(columns=[cfg.data.target] + cfg.data.drop_columns)
    y = df[cfg.data.target]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=cfg.data.test_size, stratify=y, random_state=cfg.seeds[0]
    )

    mlflow.set_experiment(cfg.experiment_name)

    for seed in cfg.seeds:
        with mlflow.start_run(run_name=f"{cfg.model.name}_seed{seed}"):
            mlflow.log_params(cfg.model.params)
            mlflow.log_params({
                "seed": seed,
                "model_name": cfg.model.name,
                "scoring": cfg.scoring,
            })
            mlflow.set_tag("git_commit", commit_hash)
            mlflow.set_tag("data_version", cfg.data.data_version)

            start_time = time.monotonic()

            model = XGBClassifier(**cfg.model.params, random_state=seed)
            cv = StratifiedKFold(n_splits=cfg.cv_splits, shuffle=True, random_state=seed)

            results = cross_validate(
                model, X_train, y_train, cv=cv, scoring=cfg.scoring
            )

            oof_preds = cross_val_predict(
                model, X_train, y_train, cv=cv, method="predict_proba"
            )[:, 1]

            mlflow.log_metric("training_duration_seconds", time.monotonic() - start_time)
            mlflow.log_metric(f"mean_{cfg.scoring}", results["test_score"].mean())
            mlflow.log_metric(f"std_{cfg.scoring}", results["test_score"].std())

            for i, score in enumerate(results["test_score"]):
                mlflow.log_metric(f"fold_{cfg.scoring}", score, step=i)

            with tempfile.TemporaryDirectory() as tmp_dir:
                oof_path = Path(tmp_dir) / f"oof_seed{seed}.parquet"
                pd.DataFrame({"oof_pred": oof_preds}).to_parquet(oof_path)
                mlflow.log_artifact(str(oof_path))

    with mlflow.start_run(run_name="config_snapshot"):
        mlflow.log_artifact(config_path)

    print(f"Logged {len(cfg.seeds)} seed runs to experiment '{cfg.experiment_name}'")
    fold_scores = results["test_score"]
    print(f"Last seed CV: {np.mean(fold_scores):.4f} ± {np.std(fold_scores):.4f}")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--config", required=True, help="Path to YAML config")
    args = parser.parse_args()
    main(args.config)
test.example.py
"""Final held-out test evaluation for a frozen experiment config.

Requires config.py (from config.example.py) on the import path.
"""

import argparse
import tempfile
from collections import defaultdict
from pathlib import Path

import mlflow
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, f1_score, get_scorer, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier

from config import ExperimentConfig

THRESHOLD_MIN = 0.0
THRESHOLD_MAX = 1.0


def evaluate_seed(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    X_test: pd.DataFrame,
    y_test: pd.Series,
    cfg: ExperimentConfig,
    seed: int,
    threshold: float,
) -> tuple[dict[str, float], pd.DataFrame]:
    """Fit on the full training split, then evaluate once on the held-out test set.

    Returns metrics dict and confusion matrix as a labeled DataFrame.
    """
    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("model", XGBClassifier(**cfg.model.params, random_state=seed)),
    ])
    pipe.fit(X_train, y_train)

    scorer = get_scorer(cfg.scoring)
    primary_score = scorer(pipe, X_test, y_test)

    y_score = pipe.predict_proba(X_test)[:, 1]
    y_pred = (y_score >= threshold).astype(int)

    metrics = {
        f"test_{cfg.scoring}": primary_score,
        "test_precision": precision_score(y_test, y_pred, zero_division=0),
        "test_recall": recall_score(y_test, y_pred, zero_division=0),
        "test_f1": f1_score(y_test, y_pred, zero_division=0),
    }

    cm_df = pd.DataFrame(
        confusion_matrix(y_test, y_pred),
        index=["actual_negative", "actual_positive"],
        columns=["predicted_negative", "predicted_positive"],
    )

    return metrics, cm_df


def main(config_path: str, threshold: float) -> None:
    if not THRESHOLD_MIN <= threshold <= THRESHOLD_MAX:
        raise ValueError(
            f"Threshold must be between {THRESHOLD_MIN} and {THRESHOLD_MAX}, got {threshold}"
        )

    cfg = ExperimentConfig.from_yaml(config_path)

    df = pd.read_parquet(cfg.data.path)
    X = df.drop(columns=[cfg.data.target] + cfg.data.drop_columns)
    y = df[cfg.data.target]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=cfg.data.test_size, stratify=y, random_state=cfg.seeds[0]
    )

    mlflow.set_experiment(cfg.experiment_name)
    summary: dict[str, list[float]] = defaultdict(list)

    for seed in cfg.seeds:
        with mlflow.start_run(run_name=f"test_{cfg.model.name}_seed{seed}"):
            mlflow.log_params(cfg.model.params)
            mlflow.log_params({
                "seed": seed,
                "model_name": cfg.model.name,
                "scoring": cfg.scoring,
                "threshold": threshold,
            })
            mlflow.set_tag("data_version", cfg.data.data_version)

            metrics, cm_df = evaluate_seed(X_train, y_train, X_test, y_test, cfg, seed, threshold)
            for name, value in metrics.items():
                mlflow.log_metric(name, value)
                summary[name].append(value)

            with tempfile.TemporaryDirectory() as tmp_dir:
                cm_path = Path(tmp_dir) / f"confusion_matrix_seed{seed}.csv"
                cm_df.to_csv(cm_path)
                mlflow.log_artifact(str(cm_path))

            print(
                f"[seed {seed}] "
                f"{cfg.scoring}={metrics[f'test_{cfg.scoring}']:.4f}, "
                f"precision={metrics['test_precision']:.4f}, "
                f"recall={metrics['test_recall']:.4f}, "
                f"f1={metrics['test_f1']:.4f}"
            )

    with mlflow.start_run(run_name="test_summary"):
        mlflow.log_params({"scoring": cfg.scoring, "threshold": threshold})
        for name, values in summary.items():
            mlflow.log_metric(f"mean_{name}", float(np.mean(values)))
            mlflow.log_metric(f"std_{name}", float(np.std(values)))

    print("\nHeld-out test summary")
    for name, values in summary.items():
        print(f"{name}: {np.mean(values):.4f} +/- {np.std(values):.4f}")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--config", required=True, help="Path to YAML config")
    parser.add_argument(
        "--threshold",
        type=float,
        default=0.5,
        help="Decision threshold for precision/recall/F1 on the held-out test set",
    )
    args = parser.parse_args()
    main(args.config, args.threshold)
baselines.example.py
"""Baseline evaluation with DummyClassifier and LogisticRegression.

Requires config.py (from config.example.py) on the import path.
"""

import argparse

import mlflow
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from config import ExperimentConfig


def _make_baselines(seed: int) -> dict[str, BaseEstimator]:
    """Construct baseline estimators with per-seed random state."""
    return {
        "dummy": DummyClassifier(strategy="most_frequent"),
        "logistic": Pipeline([
            ("scaler", StandardScaler()),
            ("model", LogisticRegression(max_iter=1000, random_state=seed)),
        ]),
    }


def main(config_path: str) -> None:
    cfg = ExperimentConfig.from_yaml(config_path)

    df = pd.read_parquet(cfg.data.path)
    X = df.drop(columns=[cfg.data.target] + cfg.data.drop_columns)
    y = df[cfg.data.target]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=cfg.data.test_size, stratify=y, random_state=cfg.seeds[0]
    )

    mlflow.set_experiment(cfg.experiment_name)

    for seed in cfg.seeds:
        baselines = _make_baselines(seed)
        for name, model in baselines.items():
            with mlflow.start_run(run_name=f"baseline_{name}_seed{seed}"):
                mlflow.log_params({"baseline": name, "seed": seed, "scoring": cfg.scoring})

                cv = StratifiedKFold(n_splits=cfg.cv_splits, shuffle=True, random_state=seed)
                scores = cross_val_score(model, X_train, y_train, cv=cv, scoring=cfg.scoring)

                mlflow.log_metric(f"mean_{cfg.scoring}", scores.mean())
                mlflow.log_metric(f"std_{cfg.scoring}", scores.std())

                for i, score in enumerate(scores):
                    mlflow.log_metric(f"fold_{cfg.scoring}", score, step=i)

                print(f"[seed {seed}] {name}: {np.mean(scores):.4f} ± {np.std(scores):.4f}")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--config", required=True, help="Path to YAML config")
    args = parser.parse_args()
    main(args.config)
hparam_search.example.py
"""Hyperparameter search with Optuna: search, freeze, evaluate.

Requires config.py (from config.example.py) on the import path.
"""

import argparse
from collections.abc import Callable
from pathlib import Path
from typing import Any

import mlflow
import numpy as np
import optuna
import pandas as pd
import yaml
from sklearn.metrics import get_scorer
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier

from config import ExperimentConfig


def make_objective(
    X_train: pd.DataFrame, y_train: pd.Series, cfg: ExperimentConfig
) -> Callable[[optuna.Trial], float]:
    """Create an Optuna objective that logs each trial to MLflow."""

    def objective(trial: optuna.Trial) -> float:
        params = {
            "n_estimators": trial.suggest_int("n_estimators", 100, 1000),
            "max_depth": trial.suggest_int("max_depth", 3, 10),
            "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.3, log=True),
            "subsample": trial.suggest_float("subsample", 0.5, 1.0),
            "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        }

        with mlflow.start_run(nested=True, run_name=f"trial_{trial.number}"):
            mlflow.log_params(params)

            pipe = Pipeline([
                ("scaler", StandardScaler()),
                ("model", XGBClassifier(**params, random_state=cfg.seeds[0])),
            ])
            cv = StratifiedKFold(n_splits=cfg.cv_splits, shuffle=True, random_state=cfg.seeds[0])
            scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring=cfg.scoring)
            mean_score = scores.mean()

            mlflow.log_metric(f"val_{cfg.scoring}", mean_score)

        return mean_score

    return objective


def freeze_best_params(cfg: ExperimentConfig, best_params: dict[str, Any], output_path: str) -> None:
    """Write the best params into a new config YAML — the reproducible record."""
    best_cfg = cfg.model_copy(deep=True)
    best_cfg.model.params = best_params

    out = Path(output_path)
    out.parent.mkdir(parents=True, exist_ok=True)
    out.write_text(yaml.dump(best_cfg.model_dump(), default_flow_style=False))
    print(f"Frozen config written to {output_path}")


def evaluate_on_test(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    X_test: pd.DataFrame,
    y_test: pd.Series,
    best_params: dict[str, Any],
    cfg: ExperimentConfig,
) -> None:
    """Multi-seed test evaluation on held-out data — the final, unbiased estimate."""
    scorer = get_scorer(cfg.scoring)
    test_scores: list[float] = []

    for seed in cfg.seeds:
        pipe = Pipeline([
            ("scaler", StandardScaler()),
            ("model", XGBClassifier(**best_params, random_state=seed)),
        ])
        pipe.fit(X_train, y_train)
        test_scores.append(scorer(pipe, X_test, y_test))

    with mlflow.start_run(run_name="final_test_evaluation"):
        mlflow.log_params(best_params)
        mlflow.log_metric(f"mean_test_{cfg.scoring}", np.mean(test_scores))
        mlflow.log_metric(f"std_test_{cfg.scoring}", np.std(test_scores))

    print(f"Test {cfg.scoring}: {np.mean(test_scores):.4f} ± {np.std(test_scores):.4f}")


def main(config_path: str, n_trials: int) -> None:
    cfg = ExperimentConfig.from_yaml(config_path)

    df = pd.read_parquet(cfg.data.path)
    X = df.drop(columns=[cfg.data.target] + cfg.data.drop_columns)
    y = df[cfg.data.target]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=cfg.data.test_size, stratify=y, random_state=cfg.seeds[0]
    )

    mlflow.set_experiment(cfg.experiment_name)

    with mlflow.start_run(run_name="hparam_search"):
        study = optuna.create_study(
            direction="maximize",
            sampler=optuna.samplers.TPESampler(seed=cfg.seeds[0]),
        )
        study.optimize(make_objective(X_train, y_train, cfg), n_trials=n_trials)

        mlflow.log_params(study.best_params)
        mlflow.log_metric(f"best_val_{cfg.scoring}", study.best_value)

    print(f"Best val {cfg.scoring}: {study.best_value:.4f}")
    print(f"Best params: {study.best_params}")

    frozen_path = f"configs/best_{cfg.model.name}.yaml"
    freeze_best_params(cfg, study.best_params, frozen_path)

    evaluate_on_test(X_train, y_train, X_test, y_test, study.best_params, cfg)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--config", required=True, help="Path to YAML config")
    parser.add_argument("--n-trials", type=int, default=100, help="Number of Optuna trials")
    args = parser.parse_args()
    main(args.config, args.n_trials)
ablation.example.sh
#!/usr/bin/env bash
# Run all ablation variants. Expects configs/ablation_*.yaml and train.py.

set -euo pipefail

CONFIGS=(configs/ablation_*.yaml)

if [ "${#CONFIGS[@]}" -eq 0 ] || [ ! -f "${CONFIGS[0]}" ]; then
    echo "No ablation configs found matching configs/ablation_*.yaml" >&2
    exit 1
fi

for config in "${CONFIGS[@]}"; do
    echo "=== Running: ${config} ==="
    python train.py --config "$config"
done

echo "All ablation variants complete."

Reproducibility contract

A result that cannot be reproduced cannot be defended. Reproducibility has five layers: source control, data immutability, seed control, environment pinning, and awareness of determinism boundaries.

Source control

MLflow automatically logs the git commit hash, but this is a false guarantee if you run experiments with uncommitted changes. The “dirty working tree” vulnerability means the logged hash points to code that is different from what actually ran.

To enforce strict source control, either fail the run if git status --porcelain is not empty, or explicitly log the git diff as an MLflow artifact.

Data immutability

Pinning code and environment is useless if the underlying data changes. Treat training data as immutable. Overwriting a file like features_v1.parquet permanently breaks reproducibility for every prior experiment that relied on it.

Use append-only storage or a data versioning tool like DVC to enforce data immutability. If a dataset must change, write it as features_v2.parquet and update the experiment configuration.

Seeds

Without fixed seeds, the same model trained twice produces different scores — random forests sample features randomly, data shuffling changes the split order, and stochastic estimators start from different states. Seeds anchor the random number generators to a known starting point.

Control randomness by passing random_state explicitly to every estimator and splitter. Scikit-learn’s randomness flows through numpy, and each random_state parameter creates an independent random number generator scoped to that object — no shared global state, no accidental coupling between components:

from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

cv = StratifiedKFold(n_splits=cfg.cv_splits, shuffle=True, random_state=seed)
model = RandomForestClassifier(n_estimators=500, random_state=seed)

Global seeding with np.random.seed() is a legacy pattern that modifies shared state and causes unintended side effects across unrelated components — avoid it. PYTHONHASHSEED is sometimes mentioned in reproducibility guides, but it controls Python set iteration order — irrelevant for scikit-learn, whose randomness flows through numpy.

A single seed tells you the pipeline runs end-to-end. It does not tell you whether the result is robust — a different seed might give a substantially different score. To make a defensible claim, you need the distribution.

Running multiple seeds is mandatory for publishable results. 3–5 seeds is often insufficient: conclusions can flip with different seeds, and statistical power is too low to detect real differences.

For classical ML with cross-validation, each seed × fold combination produces an observation, so K-fold CV × S seeds gives K × S paired data points. With 5-fold CV and 10 seeds you get 50 paired observations — enough for robust statistical testing.

Box plot showing ROC AUC score distributions across 10 seeds for three models

Setting Minimum seeds Paired observations (5-fold CV)
Exploratory 5 25
Publishable comparison 10 50
High-stakes or noisy 20+ 100+

If the standard deviation across seeds is large relative to the difference between models, the comparison is inconclusive — add more seeds or reconsider the experimental design. See Evaluation: statistical significance for guidance on minimum sample sizes and statistical tests to determine whether a difference is real.

Environment pinning

Pin everything that can change between machines or over time:

  • Python version: specify in pyproject.toml and the Dockerfile (ARG PYTHON_VERSION=3.13).
  • Dependencies: use uv.lock. The Dockerfile enforces this with uv sync --locked — if the lockfile is missing or stale, the build fails.
  • scikit-learn version: a minor version upgrade can change default hyperparameters, convergence criteria, or internal random number consumption. A model trained on 1.4 may not reproduce identically on 1.5 even with the same seed. The lockfile handles this, but document the version in your experiment logs.

The Dockerfile from the Docker guide already enforces Python version and dependency pinning via uv sync --locked.

Determinism boundaries

Full determinism is not always achievable — and sometimes not worth the cost. Know where it breaks:

  • Nested parallelism: RandomForestClassifier with random_state set is deterministic regardless of n_jobs. But nesting parallel calls — for example GridSearchCV(n_jobs=-1) wrapping RandomForestClassifier(n_jobs=-1) — can produce non-deterministic results because the joblib worker scheduling is not guaranteed to be consistent. Fix by parallelizing at one level only (typically the outer loop).
  • Floating-point non-associativity: (a + b) + c ≠ a + (b + c) in floating point. Different hardware or parallelism strategies can change the order of operations. This is typically negligible for classical ML but can compound in iterative algorithms (gradient boosting over many rounds, iterative SVD).

Practical stance: enforce seeds (per-estimator random_state on every estimator and splitter), pin the environment (see Environment pinning), and accept that bit-exact reproduction across different hardware is sometimes infeasible. Document the platform when it matters.


Data splitting

The split protocol determines what evidence you can claim. Get it wrong and metrics look good in the notebook but collapse in production — the model was learning from information it should never have seen.

Three sets, three purposes

  • Training set: used to fit the model. Every parameter update happens here.
  • Validation set: used to select hyperparameters and compare model variants. In cross-validation, each fold’s held-out portion serves as the validation set — no separate validation split is needed.
  • Test set: held out from all decisions. Evaluated once, at the end, on the final selected configuration only. If you use the test set to pick between models or tune hyperparameters, it becomes a second validation set and your reported metrics are optimistically biased.

Split before anything else. The test set must be created before any preprocessing, feature engineering, or model training. The snippets below use cfg, a Pydantic config object — defined in Configuration management:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=cfg.data.test_size, stratify=y, random_state=cfg.seeds[0]
)

train_test_split shuffles the data before splitting (shuffle=True by default). This is correct for i.i.d. tabular data — without shuffling, if the dataset is sorted by date, by class, or by source, the split is biased and the test set is not representative. stratify=y goes further: it shuffles while preserving the class distribution in both sets, so neither set ends up with a skewed label ratio. This is critical for imbalanced datasets where a plain random split could leave one class underrepresented in the test set. When stratify is set, shuffle is always on.

random_state controls the shuffle order. The test split uses a fixed seed (cfg.seeds[0]) so the test set is identical across all seed runs — comparisons are fair because every model variant is evaluated against the same held-out data.

All cross-validation, hyperparameter search, and model comparison happens on (X_train, y_train) only. The test set (X_test, y_test) is not touched until final evaluation.

For time-series or sequential data, shuffling must be disabled — random splits create future leakage where the model trains on future data and predicts the past. Use time-based splits instead: train on data before a cutoff date, validate and test on data after. scikit-learn provides TimeSeriesSplit for cross-validation on temporal data.

No leakage

Data leakage means information from the validation or test set bleeds into training — the model gets answers it would not have in production. This is the most common source of overly optimistic results. Common sources:

  • Preprocessing fit on the full dataset: fitting a scaler, encoder, or imputer on all data before splitting leaks test-set statistics into training.
  • Feature selection on the full dataset: selecting features based on correlation with the target using all data gives the model advance knowledge of validation and test patterns.
  • Target encoding without proper folds: encoding categorical features using target means from the full dataset leaks the target into every fold.

The fix: use a scikit-learn Pipeline to bundle preprocessing and model into a single estimator. When passed to cross_validate or GridSearchCV, the pipeline fits preprocessing on each training fold and transforms the validation fold — no leakage:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", XGBClassifier(**cfg.model.params, random_state=seed)),
])

# cross_validate fits the pipeline on each training fold
# and evaluates on the held-out fold — preprocessing is inside the loop
results = cross_validate(pipe, X_train, y_train, cv=cv, scoring=cfg.scoring)

If your preprocessing is too complex for a pipeline (for example, heavy feature engineering with external data), use the training indices from each fold explicitly and transform the validation indices separately. The rule is the same: fit on train, transform on validation, never the reverse.

Group-aware splits

If the dataset contains multiple samples per entity — multiple images per patient, multiple transactions per user, multiple readings per sensor — a random split scatters the same entity across train and test. The model memorizes entity-specific patterns rather than learning generalizable ones: with patient data, it learns patient A’s baseline lab values rather than the disease signal. Reported metrics can be inflated by 5–30%.

train_test_split does not support group constraints. Use GroupShuffleSplit for the initial split and GroupKFold for cross-validation:

from sklearn.model_selection import GroupShuffleSplit, GroupKFold

# Initial split respecting groups
gss = GroupShuffleSplit(n_splits=1, test_size=cfg.data.test_size, random_state=cfg.seeds[0])
train_idx, test_idx = next(gss.split(X, y, groups=groups))

X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

# Cross-validation respecting groups
cv = GroupKFold(n_splits=cfg.cv_splits)
results = cross_validate(pipe, X_train, y_train, cv=cv, groups=groups[train_idx], scoring=cfg.scoring)

If your data has no natural grouping (each row is an independent observation), standard train_test_split and StratifiedKFold are correct. When groups and class imbalance both matter, use StratifiedGroupKFold.

Early stopping

Iterative algorithms like gradient boosting (XGBoost, LightGBM) or deep learning require a separate validation set inside the training loop to monitor performance and halt training before overfitting begins. This is distinct from the validation folds used in cross-validation for model selection — early stopping uses a dedicated split within each training fold.

When using cross-validation, passing this internal validation set through a scikit-learn Pipeline can be notoriously tricky, as pipelines do not natively map sample-level fit_params to the estimator for the held-out fold.

To implement early stopping during cross-validation properly, you must either manually iterate over the fold indices and pass eval_set=[(X_val, y_val)] directly to the estimator’s fit method, or use a library-specific integration (like xgboost.sklearn):

scorer = get_scorer(cfg.scoring)
cv = StratifiedKFold(n_splits=cfg.cv_splits, shuffle=True, random_state=seed)

for fold, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
    X_fold_train, X_fold_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
    y_fold_train, y_fold_val = y_train.iloc[train_idx], y_train.iloc[val_idx]

    scaler = StandardScaler()
    X_fold_train_scaled = scaler.fit_transform(X_fold_train)
    X_fold_val_scaled = scaler.transform(X_fold_val)

    model = XGBClassifier(**cfg.model.params, random_state=seed, early_stopping_rounds=20)
    model.fit(
        X_fold_train_scaled, y_fold_train,
        eval_set=[(X_fold_val_scaled, y_fold_val)],
        verbose=False,
    )

    score = scorer(model, X_fold_val_scaled, y_fold_val)

The full early stopping CV loop is in data_splitting.example.py (see Resources).

Never use the final test set for early stopping — this turns the test set into a validation set and leaks information directly into the model’s stopping criteria.


Configuration management

Hard-coded hyperparameters buried in a training script are impossible to track and easy to forget. Externalizing parameters into YAML files means each config is a complete, diffable record of what a run did. Pydantic validates on load, provides IDE autocompletion, and serializes to a dict for MLflow. Both Pydantic and PyYAML are already transitive dependencies of MLflow — nothing extra to install.

Config structure

One YAML file per experiment configuration. Each file is a complete, self-contained record of what a run did — readable, diffable, and loggable as an MLflow artifact:

# configs/baseline.yaml
experiment_name: model_comparison

seeds: [42, 123, 456, 789, 1024, 2048, 3333, 5555, 7777, 9999]

data:
  path: data/processed/features_v1.parquet
  target: is_fraud
  test_size: 0.2

model:
  name: xgboost
  params:
    n_estimators: 500
    max_depth: 6
    learning_rate: 0.1

cv_splits: 5
scoring: roc_auc

A production-ready version with additional parameters is in baseline.example.yaml (see Resources). For ablation variants, copy the baseline and change one thing. Duplication is acceptable — readability and reproducibility matter more than DRY for experiment configs.

Config model

Define the config as a Pydantic model. Invalid configs fail immediately on load with a clear error — not silently at runtime halfway through a training run:

from pathlib import Path
from typing import Any

import yaml
from pydantic import BaseModel, Field


class DataConfig(BaseModel):
    path: str
    target: str
    test_size: float = Field(default=0.2, gt=0, lt=1)
    drop_columns: list[str] = []
    data_version: str = "v1"


class ModelConfig(BaseModel):
    name: str
    params: dict[str, Any] = {}


class ExperimentConfig(BaseModel):
    experiment_name: str
    seeds: list[int] = Field(default=[42, 123, 456, 789, 1024], min_length=1)
    data: DataConfig
    model: ModelConfig
    cv_splits: int = Field(default=5, ge=2)
    scoring: str = "roc_auc"

    @classmethod
    def from_yaml(cls, path: str) -> "ExperimentConfig":
        raw = yaml.safe_load(Path(path).read_text())
        return cls(**raw)

The complete config module with CLI validation is in config.example.py (see Resources). Load and use:

config_path = "configs/baseline.yaml"
cfg = ExperimentConfig.from_yaml(config_path)

# IDE autocomplete — cfg.model.name, cfg.data.path, cfg.scoring
# Log model params to MLflow (flat dict, MLflow-compatible)
mlflow.log_params(cfg.model.params)
# Log the config file itself as an artifact (reproduces the run)
mlflow.log_artifact(config_path)

Run experiments by pointing to a config file:

python train.py --config configs/baseline.yaml
python train.py --config configs/ablation_no_text.yaml

Experiment tracking

MLflow records parameters, metrics, and artifacts for every run. The tracking server provides a UI for comparing runs and a programmatic API for querying results.

Autologging

For scikit-learn, mlflow.sklearn.autolog() captures parameters, metrics, and model artifacts automatically — including child runs for GridSearchCV and RandomizedSearchCV, where each parameter combination is tracked as a nested run:

import mlflow
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier

mlflow.sklearn.autolog(max_tuning_runs=20)

mlflow.set_experiment(cfg.experiment_name)

with mlflow.start_run(run_name=f"{cfg.model.name}_gridsearch"):
    estimator = XGBClassifier(**cfg.model.params)
    param_grid = {"max_depth": [3, 6, 9], "learning_rate": [0.01, 0.1, 0.3]}

    search = GridSearchCV(estimator, param_grid, cv=cfg.cv_splits, scoring=cfg.scoring)
    search.fit(X_train, y_train)
    # MLflow logs all param combinations as child runs automatically

max_tuning_runs controls how many child runs are logged for hyperparameter searches — it defaults to 5, which is too low for most experiments. Set it to match the number of parameter combinations you want to inspect.

Autolog captures: estimator parameters (via get_params(deep=True)), training metrics, best score, best parameters, and the fitted model artifact. Use autologging with GridSearchCV when the search space is small and enumerable. For larger or continuous search spaces, use Optuna (see Hyperparameter search).

Explicit logging

Add explicit logging for custom metrics, tags, and artifacts that autolog does not cover. The snippet below covers the core loop — config snapshot logging is shown in train.example.py (see Resources). mlflow.log_params() only accepts flat dicts — nested dicts get stringified into unreadable blobs. Log each level explicitly:

for seed in cfg.seeds:
    with mlflow.start_run(run_name=f"{cfg.model.name}_seed{seed}"):
        mlflow.log_params(cfg.model.params)
        mlflow.log_params({"seed": seed, "model_name": cfg.model.name, "scoring": cfg.scoring})
        mlflow.set_tag("data_version", cfg.data.data_version)

        start_time = time.monotonic()

        model = XGBClassifier(**cfg.model.params, random_state=seed)
        cv = StratifiedKFold(n_splits=cfg.cv_splits, shuffle=True, random_state=seed)

        results = cross_validate(model, X_train, y_train, cv=cv, scoring=cfg.scoring)
        oof_preds = cross_val_predict(model, X_train, y_train, cv=cv, method="predict_proba")[:, 1]

        mlflow.log_metric("training_duration_seconds", time.monotonic() - start_time)
        mlflow.log_metric(f"mean_{cfg.scoring}", results["test_score"].mean())
        mlflow.log_metric(f"std_{cfg.scoring}", results["test_score"].std())

        for i, score in enumerate(results["test_score"]):
            mlflow.log_metric(f"fold_{cfg.scoring}", score, step=i)

        with tempfile.TemporaryDirectory() as tmp_dir:
            oof_path = Path(tmp_dir) / f"oof_seed{seed}.parquet"
            pd.DataFrame({"oof_pred": oof_preds}).to_parquet(oof_path)
            mlflow.log_artifact(str(oof_path))

The complete training script is in train.example.py (see Resources).

Category What How
Parameters Seed, model hyperparams, scoring mlflow.log_params(...) or autolog
Metrics CV mean, std, per-fold scores, test metrics, duration mlflow.log_metric(...) or autolog
Artifacts Model, config file, OOF predictions mlflow.log_artifact(...) or autolog
Tags Data version, custom metadata mlflow.set_tag(...) (git hash is auto-logged)

Organizing runs

Structure runs so that comparison is natural:

  • Experiment = comparable runs on the same data. One MLflow experiment per task where runs share the same input dataset and evaluation protocol (e.g., feature_selection_ablation, model_comparison_v2). The organizing principle is input data consistency — if runs use different datasets, they belong in different experiments.
  • Run = one execution. For seed sweeps, each run corresponds to one seed × one configuration. For hyperparameter searches, use a parent run with nested child runs (one per trial) — the MLflow UI renders the hierarchy so you can compare trials within a search and compare searches within an experiment.
  • Params = inputs that affect model behavior. Log seed, model hyperparameters, and scoring metric with log_param() — these are immutable per run and define what the run did.
  • Tags = metadata for filtering. Use tags for data version and any cross-cutting attribute you want to filter or group by later. Git hash is auto-logged by MLflow as mlflow.source.git.commit — no need to set it manually. Note that mlflow.set_tag("data_version", "v1") is a label, not a guarantee — it does not track what the data actually contained. For true data reproducibility, use a data versioning tool like DVC alongside your experiment tracker so that any result can be traced back to the exact dataset that produced it.
  • Run name = human-readable label. Use a descriptive pattern like xgboost_seed42 so the MLflow UI is scannable without clicking into each run.

Avoid dumping unrelated runs into the same MLflow experiment — it makes the UI unusable and comparisons meaningless. When in doubt, create a new experiment. They are cheap.


Workflow: running an experiment

Single run

The minimal loop: load config, train, evaluate, log.

python train.py --config configs/baseline.yaml

A single run validates that the pipeline works end-to-end. It is not evidence of anything — one seed, one configuration, one data split.

Baselines

Before comparing complex models, establish what “good” means. Two baselines set the floor and the bar:

  • Trivial baseline: DummyClassifier(strategy="most_frequent") predicts the majority class. Any model must beat this — if it does not, the features carry no signal or there is a data problem.
  • Simple baseline: LogisticRegression (or Ridge for regression). A tuned linear model is often competitive. If a complex model beats logistic regression by less than the standard deviation across seeds, the added complexity is not justified.
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

baselines = {
    "dummy": DummyClassifier(strategy="most_frequent"),
    "logistic": Pipeline([
        ("scaler", StandardScaler()),
        ("model", LogisticRegression(max_iter=1000, random_state=seed)),
    ]),
}

# Run inside the seed loop (for seed in cfg.seeds:)
for name, model in baselines.items():
    cv = StratifiedKFold(n_splits=cfg.cv_splits, shuffle=True, random_state=seed)
    scores = cross_val_score(model, X_train, y_train, cv=cv, scoring=cfg.scoring)
    print(f"{name}: {scores.mean():.4f} ± {scores.std():.4f}")

The full baseline script with MLflow logging and multi-seed evaluation is in baselines.example.py (see Resources). Run baselines first. If a complex model cannot beat a tuned logistic regression by a meaningful margin, the complexity is not worth the cost.

Error analysis

Before tuning hyperparameters or adding features, understand what the model gets wrong. A model stuck at 0.85 AUC might be failing on a single error category that no amount of hyperparameter tuning will fix — but that a new feature or data cleaning step would resolve in minutes. Error analysis tells you where the ceiling is and what to try next.

The method: take your baseline model’s predictions on the validation set and sample ~100 misclassified examples. For each, write down a short error category — the reason the model got it wrong. Then count:

Error category Count % of errors Ceiling if fixed
Missing address data 34 34% +0.03 AUC
Mislabeled ground truth 22 22% +0.02 AUC
Rare merchant category 18 18% +0.01 AUC
Ambiguous cases 15 15%
Other 11 11%

The “ceiling if fixed” column is the key insight: if 34% of errors come from missing address data, fixing that category (imputation, new feature, data collection) can improve performance by roughly that proportion of the current error. Hyperparameter tuning, by contrast, optimizes within the current representation — it cannot fix a missing feature or a labeling problem.

This analysis directly informs what to try next:

  • Missing or noisy features → feature engineering or data collection, not more tuning.
  • Mislabeled examples → data cleaning before any model changes.
  • Ambiguous cases → irreducible error; do not chase this.
  • Systematic failure on a subgroup → stratified evaluation, targeted features, or more training data for that subgroup.

Use sklearn.metrics.confusion_matrix for a quick overview, then inspect actual examples for the error categories that matter most. Log the error analysis table as an MLflow artifact — it documents the reasoning behind your next experiment, not just the result.

Once the pipeline works and baselines are established, hyperparameter search finds the configuration that best exploits the data. Search is not training — it is meta-optimization over training runs. The search space lives in code (distributions cannot be expressed in YAML); the config file holds fixed parameters. The full lifecycle:

  1. Search — Optuna explores the search space, MLflow logs every trial.
  2. Freeze — write the best params into a new config YAML.
  3. Evaluate — run the frozen config across all seeds as a single run.

The config records what worked. The code defines how you searched. Both are logged to MLflow.

Use Optuna for Bayesian optimization with MLflow tracking:

def objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 100, 1000),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.3, log=True),
    }

    with mlflow.start_run(nested=True, run_name=f"trial_{trial.number}"):
        mlflow.log_params(params)

        pipe = Pipeline([
            ("scaler", StandardScaler()),
            ("model", XGBClassifier(**params, random_state=cfg.seeds[0])),
        ])
        cv = StratifiedKFold(n_splits=cfg.cv_splits, shuffle=True, random_state=cfg.seeds[0])
        scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring=cfg.scoring)
        mean_score = scores.mean()

        mlflow.log_metric(f"val_{cfg.scoring}", mean_score)

    return mean_score

study = optuna.create_study(
    direction="maximize",
    sampler=optuna.samplers.TPESampler(seed=cfg.seeds[0]),
)
study.optimize(objective, n_trials=100)

The full search-freeze-evaluate lifecycle is in hparam_search.example.py (see Resources).

The inline snippet shows the trial-level logging. In practice, wrap study.optimize(...) inside a parent run (with mlflow.start_run(run_name="hparam_search"):) so each trial is a nested child — the MLflow UI renders this hierarchy for easy comparison. The full nested-run setup is in hparam_search.example.py (see Resources).

Hyperparameter search history showing trial scores and running best with plateau annotation

2. Freeze

Write the best params to a new config file — this is the reproducible record:

import yaml

best_cfg = cfg.model_copy(deep=True)
best_cfg.model.params = study.best_params

config_path = "configs/best_xgboost.yaml"
Path(config_path).write_text(
    yaml.dump(best_cfg.model_dump(), default_flow_style=False)
)

3. Evaluate

Run the frozen config across all seeds to get the validation results you report:

python train.py --config configs/best_xgboost.yaml

Then evaluate the held-out test set — this is the final, unbiased estimate. Use the same pipeline that was used during cross-validation so preprocessing is consistent. Train across all seeds to confirm stability. The --threshold flag also evaluates precision, recall, and F1 at the given operating point:

python test.py --config configs/best_xgboost.yaml --threshold 0.5

The core loop fits each seed and scores with get_scorer(cfg.scoring):

scorer = get_scorer(cfg.scoring)
test_scores = []

for seed in cfg.seeds:
    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("model", XGBClassifier(**study.best_params, random_state=seed)),
    ])
    pipe.fit(X_train, y_train)
    test_scores.append(scorer(pipe, X_test, y_test))

The full held-out evaluation — including thresholded metrics and MLflow logging — is in test.example.py (see Resources).

If the mean test score is substantially lower than the cross-validation score, the model is overfitting to the training distribution. If the standard deviation across seeds is large, the model is unstable. Do not go back and tune — that turns the test set into a validation set. This held-out score is the final check — it confirms the model generalizes. For classifiers, test.example.py evaluates at a given threshold; see Evaluation: threshold tuning for how to choose that threshold.

Small datasets: when the dataset is too small for a reliable held-out test set (roughly n < 1,000), the holdout estimate has high variance. Use nested cross-validation instead: an outer loop evaluates generalization while an inner loop selects hyperparameters. This gives an unbiased performance estimate without sacrificing data. The cost is computational — 5-fold outer × 5-fold inner × 100 trials is expensive, so reserve nested CV for settings where every sample counts.

Search strategy: Bayesian optimization (Optuna’s TPE sampler) outperforms grid search and random search in most settings — it focuses trials on promising regions rather than sampling blindly. Random search is a reasonable baseline when the search space is small. Grid search is wasteful — it spends trials on unimportant dimensions.

Reproducibility: TPESampler(seed=...) makes the search deterministic, but only with sequential execution. Parallel optimization (n_jobs > 1) reseeds internally to avoid duplicate suggestions across workers, breaking reproducibility. The search evaluates each trial on a single seed’s CV folds — this adds variance to the selection. The multi-seed evaluation in step 3 guards against lucky picks: if the frozen config degrades across 10 seeds, the search result was not robust.

Budget: start with 100 trials. Check the optimization history — if the best score plateaus after 30 trials, more trials will not help. If it is still improving at 100 trials, the search space may be too large or the problem too noisy. Be aware of the total cost: 100 Optuna trials × 5-fold CV = 500 model fits for the search alone, plus 10 seeds × 5-fold CV = 50 fits for the final evaluation. Start with fewer seeds or trials during exploration and scale up for the final run.

Ablation study

Ablation answers: “does this component actually help?”

  1. Start from the full system — the configuration with all components enabled. This is your control.
  2. Remove one component at a time and re-run. Each ablation variant changes exactly one thing so the effect is attributable. This is leave-one-component-out (LOCO) ablation — the standard approach. Its limitation: it cannot detect interaction effects. If two components are individually weak but jointly critical (synergy), or if one component compensates when another is removed (masquerade), LOCO will miss it. Use a factorial design when you suspect interactions.
  3. Run each variant across multiple seeds (same seeds for all variants).
  4. Compare to the control using the same metric and statistical test (see Evaluation: comparing results).

Run all variants with a simple loop:

for config in configs/ablation_*.yaml; do
    python train.py --config "$config"
done

A shell runner with config validation is in ablation.example.sh (see Resources). Report ablation results as a table: each row is a variant, columns show the metric mean ± std, and the delta from the control. This is the format reviewers expect:

Variant AUC (mean ± std) Δ vs. control p-value
Full system (control) 0.891 ± 0.012
− text features 0.864 ± 0.015 −0.027 0.003
− interaction terms 0.887 ± 0.013 −0.004 0.42
− feature selection 0.879 ± 0.014 −0.012 0.04

Ablation study bar chart showing AUC delta vs control with significance markers

For statistical tests and how to read comparison tables, see Evaluation: comparing results.


Common pitfalls

  • Single-seed conclusions. One seed proves nothing. A model can win on seed 42 and lose on seed 123. Always run multiple seeds and report the distribution.
  • Overfitting the validation set. If you run 200 hyperparameter trials optimizing validation AUC, the best validation AUC is biased upward. The test set must be held out from the entire search process — evaluate it once, at the end, on the selected configuration only.
  • Comparing across different splits. If model A was evaluated on a random 80/20 split and model B on 5-fold CV, the comparison is meaningless. Fix the evaluation protocol before comparing.
  • Leakage through preprocessing. Fitting a scaler, encoder, or imputer on the full dataset before splitting leaks information from the test set into training. All preprocessing must be fit inside the cross-validation loop, on the training fold only.
  • Selective reporting. Running many experiments and reporting only the best result is p-hacking. Report all variants, including the ones that did not work — failed experiments are information.
  • Confounding seed with configuration. If you change the learning rate and the seed at the same time, you cannot attribute the result to either. Change one variable at a time, or use a full factorial design with multiple seeds per configuration.
  • Not logging enough to reproduce. If you cannot re-run a result six months later from the logged config, the logging was insufficient. Log the complete config, the git hash, and the environment. Test reproduction before you need it.
  • Not analyzing errors before iterating. Jumping from a baseline score to hyperparameter tuning skips the most informative step. Error analysis on ~100 misclassified examples often reveals that the dominant error category (missing features, mislabeled data, a broken preprocessing step) cannot be fixed by tuning — and that fixing it improves performance more than any hyperparameter change would (see Error analysis).
  • Not documenting compute resources. A result that takes 8 GPU-hours to reproduce but does not say so wastes the reviewer’s time — or makes the result irreproducible for anyone without equivalent hardware. Log runtime, hardware (CPU/GPU, memory), and total compute cost alongside metrics.

results matching ""

    No results matching ""