Cross-entropy is an information-theoretic quantity that measures the cost of describing data from one probability distribution using a code that was optimized for a different distribution. It builds directly on the entropy that Claude Shannon defined in his 1948 paper A Mathematical Theory of Communication. Shannon’s source coding theorem shows that the most efficient code for a source has an average length equal to its entropy; cross-entropy asks what happens when you use the wrong code.
The answer is that the average length grows. If the true distribution of the data is P but you encode it with a scheme designed for a guessed distribution Q, the expected number of bits per symbol is the cross-entropy between P and Q, which is always at least as large as the true entropy of P. The gap between the two is exactly the Kullback-Leibler divergence, the penalty for using the wrong model.
This idea has become the default training objective in machine learning. When a classifier outputs a predicted probability distribution over labels, the cross-entropy between the true label distribution and the predicted one measures how surprised the correct answer is by the model’s prediction.
Minimizing cross-entropy therefore pushes a model’s predicted probabilities toward the truth, which is why it is the loss function behind most modern classifiers and language models. A general reader can read it as a precise score for how badly a model’s confident guesses miss reality, with a smaller score meaning a better-calibrated model.