Submitted by fredlafrite t3_106no9h in MachineLearning
xenotecc t1_j3v3gr0 wrote
Reply to comment by suflaj in [D] Have you ever used Knowledge Distillation in practice? by fredlafrite
How small do you make the student, when a teacher is let's say ResNet101? How do you find a good student/teacher size ratio?
Are there any tricks to knowledge distillation? Or just standard vanilla procedure?
suflaj t1_j3vg5tm wrote
I think it's a matter of trial and error. The best ratios I've seen were 1:25, but these concerned transformer networks, which are much more sparse than resnets.
There are some tricks, but it depends on the model. Ex. for transformers, it's not just enough to imitate the last layer. I suspect that it's the same for resnets, given they're deep residual networks just like transformers.
xenotecc t1_j3vp75c wrote
Thank you for the reply!
Viewing a single comment thread. View all comments