suflaj t1_j3igfzr wrote on January 8, 2023 at 8:23 PM

Yes, it's the only way to get high throughput high performance models ATM.

With KD and TensorRT you can get close to 100x throughput (compared to eager TF/PyTorch on full model) with 1% performance hit on some models and tasks.

nmfisher t1_j3l4ipq wrote on January 9, 2023 at 8:36 AM

Echoing this, KD is also very useful for taking a heavyweight GPU model and training a student model that's light enough to run on mobile. Small sacrifice in quality for huge performance gains.

fredlafrite OP t1_j3l65ju wrote on January 9, 2023 at 8:58 AM

Interesting! Echoing this, do you know which kind of companies one could work on this in an applied setting?

Think_Olive_1000 t1_j3qynuz wrote on January 10, 2023 at 1:56 PM

Neural magic does work in this space, not sure about KD specifically

xenotecc t1_j3v3gr0 wrote on January 11, 2023 at 6:51 AM

How small do you make the student, when a teacher is let's say ResNet101? How do you find a good student/teacher size ratio?

Are there any tricks to knowledge distillation? Or just standard vanilla procedure?

suflaj t1_j3vg5tm wrote on January 11, 2023 at 9:37 AM

I think it's a matter of trial and error. The best ratios I've seen were 1:25, but these concerned transformer networks, which are much more sparse than resnets.

There are some tricks, but it depends on the model. Ex. for transformers, it's not just enough to imitate the last layer. I suspect that it's the same for resnets, given they're deep residual networks just like transformers.

xenotecc t1_j3vp75c wrote on January 11, 2023 at 11:38 AM

Thank you for the reply!