Knowledge Distillation

Knowledge distillation is a model compression technique where a smaller "student" model is trained to mimic the behavior of a larger "teacher" model. The teacher’s soft targets (probability distributions) provide richer information than hard labels, enabling the student to achieve comparable performance with fewer parameters and less computation.

Imagine a master chef teaching an apprentice. Rather than having the apprentice go through all experiments, the master shares refined techniques and shortcuts so that the apprentice achieves similar results without all the background knowledge.