Cross-entropy measures how many bits (on average) are needed to encode events from distribution P using a code optimized for distribution Q:

H(P,Q) = -∑ P(x) log₂ Q(x)

When P represents the true data distribution and Q the model's predicted distribution, cross-entropy quantifies the inefficiency of using the wrong distribution for encoding. Lower values indicate better alignment between the true and predicted distributions.

Applications in Machine Learning:

  • Classification Loss: Cross-entropy loss trains neural networks to output probability distributions matching true class labels
  • Natural Language Processing: Measuring model performance in next-token prediction tasks
  • Information Retrieval: Evaluating relevance rankings in search algorithms