alkibijad
alkibijad OP t1_j6wuvnc wrote
Reply to comment by TheDeviousPanda in [D] Apple's ane-transformers - experiences? by alkibijad
Can you please elaborate your answers and quantify?
I'm most interested in the effort for bullets 2 and 3. In your own experience, did it take hours, days, weeks?
alkibijad OP t1_j6w7lo3 wrote
Reply to comment by TheDeviousPanda in [D] Apple's ane-transformers - experiences? by alkibijad
That was not the answer I was hoping for, but very helpful :)
Do you have any code/repo to share? I'm only able to find the DistilBERT implementation in apple's repo, would like to see some other examples?
alkibijad OP t1_j462x0f wrote
Reply to comment by suflaj in [D] Is there a distilled/smaller version of CLIP, or something similar? by alkibijad
I was hoping to just fine-tune the model, let the training last days at most. Seems like my best chance is to wait for distilled stable diffusion, and use their clip encoder, as u/LetterRip mentions.
alkibijad OP t1_j462o4r wrote
Reply to comment by LetterRip in [D] Is there a distilled/smaller version of CLIP, or something similar? by alkibijad
Cool, I wasn't aware of the distilled diffusion! That could be useful, thanks for sharing!
alkibijad t1_j42c8g3 wrote
I think it's going to be everywhere, but mostly Bing and Office products. Those are things where it can have an immediate impact.
alkibijad t1_j14x08t wrote
Reply to comment by nuthinbutneuralnet in [D] Simple Questions Thread by AutoModerator
This may not be the direct answer, but it's applicable to many problems:
- Use the simplest approach first. This would be creating a simple model, in this case flat fully connected layer.
- Measure the results.
- If the results aren't good enough, think about what could improve the results: different model architecture, training procedure, obtaining more data...
- Iterate (go to 2)
Also:
`creating linear or embedding layers for each feature group before combining them together` - this adds additional knowledge into the network, so it may help... but in theory the network should be able to find this out on its own - the combinations that don't have much sense will have weights close to zero - that's why I advise you to start without it (and try doing it without it).
​
1K+ features: in some cases this is a lot of features, in some it's not that big number... but it maybe makes sense to reduce the number of features, by using some of the dimension reduction techniques.
alkibijad OP t1_j7f1b2e wrote
Reply to comment by vade in [D] Apple's ane-transformers - experiences? by alkibijad
Looking forward to hearing their experiences!