How to Shrink Large AI Models Using Knowledge Distillation
A method for teaching small, efficient AI models to mimic the complex decision-making patterns of much larger, more powerful neural networks.
Patent Number
US 10289962
Status
Active
Filing Date
June 4, 2015
Grant Date
May 14, 2019
Expiration
~June 2035 (estimated)
Claims
23
Assignee
Google LLC
Inventors
Oriol Vinyals, Geoffrey E. Hinton, Jeffrey A. Dean
Citations
4 forward · 3 backward
What it covers
This patent describes a process called knowledge distillation. First, a large, heavy 'cumbersome' model is trained on a dataset to learn complex patterns. Then, a smaller 'distilled' model is trained, not just to predict the correct answer, but to mimic the probability distribution (the 'soft outputs') of the large model. By using a 'temperature constant' higher than 1 during training, the model is forced to pay attention to the relationships between incorrect answers, which provides more information than a simple right-or-wrong label. This allows the smaller model to achieve performance levels close to the large model while being much faster and lighter for mobile devices.
What it doesn't cover
- —Does not cover training models from scratch without a pre-existing cumbersome model.
- —Does not cover hardware-specific optimization techniques like model quantization or pruning.
- —Does not cover methods where the distilled model is trained using only hard labels (e.g., just the correct class) instead of soft outputs.
- —Does not cover architectures where the distilled model has more parameters than the cumbersome model.
The clever bit
The innovation is using a 'temperature' parameter to soften the output distribution, which reveals the 'dark knowledge'—the subtle hints about how the big model views the similarities between different categories.
Why it matters
This technique is fundamental to modern AI deployment. It allows companies like Google to run sophisticated language models and image classifiers on smartphones and edge devices that lack the massive computing power required by the original, cumbersome models. It effectively bridges the gap between research-grade supercomputing and consumer-grade hardware.
Real-world examples
- 1.Mobile versions of Google Translate
- 2.On-device voice recognition on Android phones
- 3.Lightweight image classification models for mobile apps
Generated by PatentBrief · Not legal advice · patentbrief.org
US 10289962 · 2026