Logistic Regression

A supervised learning model that applies the sigmoid function to a linear model, producing a probability estimate:

where are weights, is a bias, and is interpreted as .

Logistic regression is not a classification algorithm on its own - the output is a probability, a real value in . It becomes a classifier when combined with a decision rule, e.g., predict class 1 if .

Why not just use linear regression for classification?

Linear regression outputs unbounded values - a point far from the decision boundary might get predicted as , which is meaningless for class probabilities. The sigmoid squashes everything into , giving proper probabilities that sum correctly.

The key difference is the nonlinearity: logistic regression is essentially the simplest possible neural network - a linear model followed by one nonlinear activation function. Stack more layers with nonlinearities and you get a multi-layer perceptron. Sigmoid is the classic choice, but other functions work too.

Optimization

Unlike linear regression, logistic regression has no closed-form solution. We can use gradient descent:

With cross-entropy loss, the problem is convex (single global minimum). MSE would make it non-convex: the sigmoid’s flat tails create near-zero gradients when predictions are confidently wrong, trapping gradient descent. Cross-entropy penalizes confident wrong predictions heavily, keeping gradients informative:

Multi-class extension

For classes, replace sigmoid with softmax:

Each class gets its own weight vector. Softmax ensures probabilities sum to 1 across all classes.