Viewing a single comment thread. View all comments

nmfisher t1_j62y29r wrote

Slight tangent - has anyone ever tried "fine-tuning" a large speech recognition model (e.g. Whisper) by feeding a training set and pruning activations? The idea being that only a subset of weights/activations are necessary for a given speaker/dataset, so you can compress a larger model into a smaller one (and then continue retraining conventionally) that performs equally well for a given subset of data. Presumably this would require some degree of sparsity to begin with?

29

_Ruffy_ t1_j635zdc wrote

Good idea in principle, anyone know more about this or any references?

5

anony_sci_guy t1_j63nj0u wrote

This was exactly my first thought too - free up all those extra parameters & re-randomize them. Problem could be that the re-randomized parameters will have a big gap in distribution between the pre-tuned and re-randomized weights, so you'd want different step sizes for them. I've played with it before & ran into this problem, but got too lazy to actually implement a solution. (I'm actually a biologist, so don't really have bandwidth to dig into the ML side as much)..

3

starfries t1_j64qhqa wrote

Can you elaborate on this? I'm trying something similar, so I'm curious what your results were and if you ran across any literature about this idea.

2

anony_sci_guy t1_j681trq wrote

Yeah, there is some stuff published out there. It's related to pruning (A link to a ton of papers on it); the lottery ticket method solves this one well, because you're re-training from scratch, just with "lucky" selection of the initialized weights. Results-wise, I never got anything to improve because of the distributional changes caused by trying to re-randomize a subset in the middle of training. Still saw the same level of performance as without re-randomizing, but that basically just showed that the way that I was re-randomizing wasn't helping or hurting b/c those neurons weren't important...

2

starfries t1_j6l0aeq wrote

Thanks for that resource, I've been experimenting with the lottery ticket method but that's a lot of papers I haven't seen! Did you initialize the weights as if training from scratch, or did you do something like trying to match the variance of the old and new weights? I'm intrigued that your method didn't hurt performance - most of the things I've tested were detrimental to the network. I have seen some performance improvements under different conditions but I'm still trying to rule out any confounding factors.

1

anony_sci_guy t1_j6mr4k6 wrote

Glad it helped! The first thing I tried was just to re-initialize just like at the beginning of training, but I don't remember how much I dug into trying to modify it before moving on. That's great your seeing some improvements though! Would love to hear how the rest of your experiment goes!! =)

2

ApprehensiveNature69 t1_j651pux wrote

Yep! This is known technique - if you search for it lots of papers on sparse fine tuning show up, its a very valid technique.

2