Think of it this way: a recipe is the algorithm. The dish that comes out is the model. You can follow the same recipe with different ingredients and get a different dish. Similarly, the same algorithm trained on different data produces a different model – one that may be excellent, mediocre, or completely wrong depending on the quality of what went into it.
Algorithm vs Model: The Distinction That Actually Matters
| Concept | What It Is | Analogy | Example |
|---|---|---|---|
| Algorithm | The learning procedure – rules for how to adjust based on data | A recipe | Random Forest, Gradient Boosting, Backpropagation |
| Model | The trained artifact – weights, rules, or structure learned from data | The cooked dish | A .pkl file, a neural network with fixed weights, a decision tree |
| Training | Running the algorithm on data to produce the model | Cooking | Fitting RandomForestClassifier on your dataset |
| Inference | Using the trained model to make predictions on new data | Serving the dish | model.predict(new_customer_data) |
The Model Lifecycle: From Training to Retirement
1. Data Collection and Preparation
Garbage in, garbage out is the most repeated phrase in machine learning – and the most ignored. A model is only as good as the data it learned from. Data preparation typically consumes 60-80% of a data scientist’s time and includes cleaning missing values, encoding categorical variables, normalizing scales, and splitting into train/validation/test sets.
2. Training
The algorithm iterates through training data, adjusting internal parameters (weights, thresholds, split points) to minimize a loss function – a measure of how wrong the current model’s predictions are. Each pass through the full training dataset is called an epoch. Training stops when performance on a held-out validation set stops improving.
3. Validation and Hyperparameter Tuning
Hyperparameters are the settings of the algorithm itself – how many trees in a forest, how deep each tree grows, learning rate. These are not learned from data; they are set by the practitioner. Grid search, random search, and Bayesian optimization are common methods for finding the hyperparameter combination that produces the best-performing model.
4. Testing on Held-Out Data
The test set is data the model has never seen – not during training, not during validation. This is the final, honest measure of how the model will perform in the real world. A model that performs brilliantly on training data but poorly on test data has overfit – it memorized rather than learned.
5. Deployment
A trained model saved to disk is not yet useful. Deployment means wrapping it in an API, embedding it in an application, or integrating it into a data pipeline so that real users or real systems can call it. This step involves software engineering skills that are separate from model training – containerization, API design, latency optimization, and load handling.
6. Monitoring and Drift Detection
A deployed model degrades over time as the real world changes. A fraud detection model trained on 2022 fraud patterns may perform poorly against 2025 tactics. Model drift occurs when the relationship between input features and outputs changes in the real world. Production monitoring tracks prediction distributions and triggers retraining when performance drops.
Types of Models by Output
| Model Type | What It Outputs | Real-World Example | Common Algorithms |
|---|---|---|---|
| Classifier | A category or class label | Spam / not spam; disease present / absent | Logistic Regression, Random Forest, SVM, Neural Nets |
| Regressor | A continuous number | House price, sales forecast, temperature | Linear Regression, XGBoost, SVR |
| Clustering model | Group assignments for unlabelled data | Customer segments, document topics | K-Means, DBSCAN, Gaussian Mixture |
| Ranking model | Ordered list by relevance or score | Search results, product recommendations | LambdaMART, learning-to-rank models |
| Generative model | New synthetic data (text, images, audio) | ChatGPT responses, Midjourney images | LLMs (Transformers), GANs, Diffusion models |
| Anomaly detection | Flag of unusual or outlier observations | Fraud transaction, equipment failure signal | Isolation Forest, Autoencoders, One-Class SVM |
How Models Are Evaluated: The Metrics That Matter
Accuracy is the most misunderstood metric in machine learning. A model that predicts ‘not fraud’ for every transaction achieves 99.9% accuracy on a dataset where fraud is 0.1% of cases – and catches zero fraud. The right metric depends on what matters in your specific context.
| Metric | Used For | What It Measures | When It Matters Most |
|---|---|---|---|
| Accuracy | Classification | % of correct predictions overall | Balanced classes only |
| Precision | Classification | Of predicted positives, how many are real? | High cost of false alarms (spam filters) |
| Recall | Classification | Of actual positives, how many were caught? | High cost of missing cases (cancer screening) |
| F1 Score | Classification | Harmonic mean of precision and recall | Imbalanced classes |
| AUC-ROC | Classification | Model’s ability to separate classes across thresholds | Ranking quality, imbalanced data |
| RMSE | Regression | Average magnitude of prediction errors | Penalises large errors heavily |
| MAE | Regression | Average absolute prediction error | Robust to outliers |
| NDCG | Ranking | Quality of ranking order | Search, recommendations |
Model Drift: Why Yesterday’s Model Fails Tomorrow
Model drift is the gradual degradation of a deployed model’s performance as the world changes. There are two main types:
Data drift (covariate shift): The distribution of input features changes. Example: a model trained on desktop user behaviour degrades as most users switch to mobile.
Concept drift: The relationship between features and the target variable changes. Example: what constitutes fraudulent behaviour changes as attackers adapt to your defences.
Monitoring for drift requires tracking prediction distributions, feature distributions, and real-world outcomes over time. When metrics fall below defined thresholds, the model is retrained on fresh data. In high-stakes environments, this happens automatically via MLOps pipelines.
The Gap Between a Model and a Product
This is where many data science projects die quietly. A model with 89% accuracy on a Jupyter notebook is not a product. The remaining work – productionising – is often underestimated and underfunded:
- Latency: Does it respond in milliseconds (required for real-time applications) or seconds (acceptable for batch)?
- Explainability: Can you tell a customer or regulator why the model made a decision? Required in finance, healthcare, and HR by law in many jurisdictions.
- Fairness auditing: Does the model discriminate against protected groups? Bias in training data produces biased outputs.
- Fallback logic: What happens when the model is unavailable or confidence is below threshold?
- Versioning: How do you roll back to a previous model if the new one performs worse in production?
The best models fail in production not because the machine learning was wrong, but because the surrounding engineering, governance, and monitoring infrastructure was not built. A mediocre model with excellent production infrastructure often delivers more business value than a brilliant model deployed carelessly.
