What is logistic regression in machine learning?

What logistic regression actually is

Logistic regression is one of the oldest and most resilient tools in the supervised-learning toolbox. It predicts the probability of a binary outcome by mapping a linear combination of input features through the logistic (sigmoid) function. The output is a number between 0 and 1, which a decision threshold then turns into a class label. That is the entire model — and the reason it remains a default baseline in production systems where interpretability and calibration matter more than raw accuracy.

The name is misleading in a useful way. Despite the word “regression”, the technique is a classifier. What gets regressed is the log odds of the positive class against the input features, not the class label itself. We mention this early because most confusion about logistic regression traces back to people expecting it to behave like linear regression on a binary target — which it deliberately does not.

How the sigmoid and log odds fit together

The mechanism is worth seeing in one piece. The model computes a weighted sum of the inputs — $z = w_0 + w_1 x_1 + \dots + w_n x_n$ — and then squashes $z$ through the sigmoid $\sigma(z) = 1 / (1 + e^{-z})$. The output is interpretable as the estimated probability that the example belongs to the positive class.

The other direction is just as important. If $p$ is the predicted probability, then $\log(p / (1-p)) = z$. That left-hand side is the log odds, and the model is linear in the log-odds space. This is the structural assumption every logistic regression carries with it: the relationship between the features and the log odds is additive and linear. If that assumption is violated badly enough — non-monotonic effects, strong interactions, threshold behaviour — the model will silently underperform regardless of how much data you throw at it.

Parameters are fit by maximum likelihood estimation: we pick the weights that make the observed labels most probable under the model. In practice this is solved with a convex optimiser (L-BFGS, SAG, or stochastic gradient descent for large datasets). Convexity is a quiet but significant property — it means there is one global optimum, and we will reach it. Most modern alternatives (neural networks, gradient-boosted trees) do not give you that guarantee.

When logistic regression is the right tool

There is a recognisable shape to problems where logistic regression is the correct default:

Signal	Why logistic regression fits
Binary or two-stage decision	The model’s native output is a calibrated probability for one class.
Small to mid-size tabular data	Few parameters, low variance, performs well with limited samples.
Need to explain why	Each coefficient has a direct log-odds interpretation.
Auditable or regulated context	Decisions can be traced back to feature contributions.
Probability matters, not just the label	Sigmoid outputs are well-calibrated under mild assumptions.

Credit scoring, clinical risk models, ad click-through prediction, and fraud triage all sit in this band. None of them are technically the hardest problems in machine learning, but they are problems where being wrong is expensive and being inexplicable is worse.

Where it quietly fails

The honest version of this story includes the failure modes, because logistic regression rarely fails loudly. It fails by producing a model that fits, ships, and underperforms in ways the team only notices weeks later.

Non-linear feature-to-log-odds relationships. If risk rises and then falls as a feature increases, a single coefficient cannot represent that. Polynomial features or splines can patch this, but at that point a tree-based model is usually a better choice.
Strong feature interactions. Logistic regression treats features as additive in log-odds space. Cross-features have to be engineered manually. Gradient-boosted trees and neural networks discover them.
High-cardinality categorical data. One-hot encoding inflates dimensionality fast. Target or frequency encoding helps, but introduces leakage risk if you are not careful.
Severe class imbalance. With 1% positives, the model will happily predict the majority class and look accurate. Class weighting, threshold tuning, and proper evaluation under ROC-AUC or precision-recall curves are mandatory, not optional.
Multi-class problems with structure. Multinomial logistic regression handles unordered classes, but if the classes have a real hierarchy or ordering, you usually want a different model family.

The reason these are quiet failures is that the training loss can look fine while the model is fundamentally misspecified. We see this pattern often when teams have moved from a small pilot dataset to a larger production stream and never retested the linearity-in-log-odds assumption.

Logistic regression vs neural networks

This comparison comes up in nearly every machine-learning discussion, usually framed as a contest. It is not. The two models answer different questions about the same data.

Logistic regression assumes a linear decision boundary in feature space and produces a single interpretable weight per feature. A neural network learns a non-linear decision boundary through stacked transformations, at the cost of interpretability and the need for substantially more data and compute. On a problem with 5,000 rows, ten clean features, and a regulatory requirement to explain decisions, logistic regression wins by structure, not by accuracy. On image classification or natural language understanding, it is the wrong tool by construction — the features are not informative until a deeper model has transformed them.

A useful heuristic: if you cannot articulate why a feature should affect the outcome, logistic regression will not save you. If you can, and the relationships are roughly monotonic, logistic regression will probably outperform a deeper model on the metrics that matter — including calibration, training time, and the ability to defend the model in front of a non-technical audience.

Regularisation matters more than people think

Plain logistic regression overfits less than a deep network, but it does overfit, particularly with many correlated features. L2 regularisation (ridge) shrinks coefficients toward zero and is the safe default. L1 regularisation (lasso) drives some coefficients exactly to zero and is the right choice when feature selection is part of the goal. Elastic net combines both. In practice, the choice of regularisation strength — usually controlled by an inverse parameter C in scikit-learn — affects out-of-sample performance more than most other knobs available to you. Cross-validate it.

How we use logistic regression in practice

In our consulting engagements, logistic regression is almost never the final model — but it is almost always the first one. We treat it as a diagnostic. If a properly regularised logistic regression with thoughtful features cannot beat the baseline, the problem is usually in the data (labels, leakage, sample selection) rather than in the model class, and switching to a neural network will only hide the underlying issue. If logistic regression performs reasonably and a more complex model only marginally improves on it, we often ship the logistic regression — the operational cost of maintaining a transparent model is far lower than maintaining an opaque one.

We pair this with the usual practitioner discipline: clean train/validation/test splits with no temporal leakage, calibration checks (reliability diagrams, Brier score), and explicit threshold tuning against the business cost matrix rather than the default 0.5 cutoff. The default threshold is rarely the right one.

A short worked example

Consider a churn-prediction model with three features: months since last purchase, support tickets in the last 90 days, and average order value. After fitting on roughly 20,000 customers with a 12% churn rate, a typical logistic regression might yield:

Feature	Coefficient (log odds)	Interpretation
Months since last purchase	+0.42	Each additional month raises churn odds by ~52%
Support tickets (90 days)	+0.18	Each additional ticket raises churn odds by ~20%
Average order value (scaled)	−0.31	Higher-spend customers churn less

Each row is a defensible statement. A product manager can read the model. A compliance reviewer can audit it. A data scientist can spot, immediately, that a coefficient with the wrong sign means something is wrong upstream — possibly a leakage issue, possibly a misencoded feature. That diagnostic property is not incidental; it is the whole reason this technique has survived sixty years of newer methods.

How TechnoLynx approaches classification problems

When clients come to us with a binary-decision problem — fraud, churn, default risk, lead qualification — we start by asking what the cost of being wrong looks like in each direction, and whether decisions need to be auditable. The answers usually point clearly to either a logistic-regression-class model with strong feature engineering, or a gradient-boosted approach with careful explainability tooling on top. We deliberately avoid defaulting to deep learning for problems where it adds complexity without measurable lift; in our experience that pattern is one of the more common sources of stalled ML projects.

Logistic regression is not glamorous. It is, however, the model that quietly continues to work after the hype cycle moves on — and that durability is worth taking seriously.

FAQ

Is logistic regression a regression or a classification algorithm? It is a classification algorithm despite its name. The “regression” refers to the linear regression performed in the log-odds space, not to a continuous-valued prediction. The final output is a probability that gets thresholded into a class label.

Why use the sigmoid function specifically? The sigmoid is the inverse of the log-odds transformation, which means a linear model on the log odds becomes a probability between 0 and 1 after the sigmoid is applied. It also has a clean derivative that makes maximum likelihood estimation tractable with standard convex optimisers.

When should I use logistic regression instead of a neural network? When the data is tabular and mid-sized, the relationships between features and outcomes are roughly linear in log-odds space, and decisions need to be auditable. For image, text, or audio problems, or when feature interactions are dense and unknown, a neural network or gradient-boosted model is usually a better fit.

Does logistic regression handle multi-class problems? The multinomial extension does. It generalises the sigmoid to the softmax function and fits one set of coefficients per class. For problems with more than a handful of classes, or with ordered or hierarchical class structure, other model families often perform better.

What is the most common mistake when applying it? Trusting the default decision threshold of 0.5. The right threshold depends on the cost of false positives versus false negatives in your specific problem, and it is almost never 0.5 in practice. Calibrate first, then tune the threshold against the business cost matrix.

Continue reading: How to use GPU programming in machine learning.

Image credits: Freepik