Activation functions are mathematical functions applied to the output of neurons in a neural network. They introduce non-linearity into the model, enabling it to learn complex patterns and make decisions based on the input data.

Think of activation functions as "switches" that determine whether a neuron should be activated (or "fire") based on its input. Different activation functions have different shapes, which affect how the network learns and generalizes.

Common activation functions include:

  • ReLU (Rectified Linear Unit): The workhorse of modern neural networks. It outputs the input directly if positive, otherwise outputs zero. Benefits include computational efficiency and reducing the vanishing gradient problem. Ideal for hidden layers in most networks, especially CNNs.
  • Sigmoid: Maps inputs to values between 0 and 1, creating a smooth S-shaped curve. Historically popular but prone to vanishing gradients with deep networks. Best used for binary classification output layers or gates within specialized architectures like LSTMs.
  • Tanh (Hyperbolic Tangent): Similar to sigmoid but maps inputs to values between -1 and 1, making the outputs zero-centered. This often leads to faster convergence. Useful for hidden layers in recurrent networks and cases where negative outputs are meaningful.
  • Softmax: Converts a vector of values into a probability distribution that sums to 1. Essential for multi-class classification output layers, where each neuron represents the probability of a specific class.
  • Leaky ReLU: A variation of ReLU that allows a small gradient when the input is negative, helping prevent "dead neurons". Useful alternative to standard ReLU when dealing with sparse data.