baffo32

baffo32 t1_jdrhj77 wrote

I was still confused as to your response, and I’m thinking that if you wanted a model to behave like you had given different pretraining data, you would probably first finetune on the different bulk data, and then after this finetune on the target task such as instruction following.

Instruction following is indeed of course just predicting the next word: on data where the next word is obedient to instructions preceding it.

1

baffo32 t1_jcronvh wrote

- offloading and accelerating (moving some parts to memory mapped disk or gpu ram, this can also make for quicker loading)

- pruning (removing parts of the model that didn’t end up impacting outputs after training)

- further quantization below 4 bits

- distilling to a mixture of experts?

- factoring and distilling parts out into heuristic algorithms?

- finetuning to specific tasks (e.g. distilling/pruning out all information related to non-relevant languages or domains) this would likely make it very small

EDIT:

- numerous techniques published in papers over the past few years

- distilling into an architecture not limited by e.g. a constraint of being feed forward

3

baffo32 t1_j8zbmua wrote

dry is a very basic software engineering principle that means to include only one copy of every sequence of code. it looks like machine learning people did not learn this as they weren’t trained as software engineers. DRY stands for “don’t repeat yourself”, and if not respected then it gets harder and slower more and more to maintain, improve, or bugfix software, the larger and older it gets.

2

baffo32 t1_j8vsq9s wrote

looks like there is emotional or funded influence here, cointerintuitive votes, strange statements stated as facts

Duplicated code makes a very very _unhackable project_ because one has to learn the code duplicating systems and add functionality to them for every factorization. It does make _hackable examples_ but the codebase doesn’t seem to understand where to draw the line at all.

The library looks like it was made entirely without an experienced lead software engineer. As a corporation they should have one.

​

HuggingFace, please understand that software developers find DRY to be hackable. The two terms usually go together. It reads like a contradiction, like fake news trying to manipulate people by ignoring facts, to state it the other way around.

4

baffo32 t1_j8vrc4d wrote

HuggingFace recently implemented a PEFT library that reimplements the core functionality of AdapterHub. AdapterHub had reached out to them to contribute and integrate work but this failed in February of last year ( https://github.com/adapter-hub/adapter-transformers/issues/65#issuecomment-1031983053 ). Hugging Face was asked how the work related to the old and it was so sad to see they had done it completely independently, completely ignoring the past outreach ( https://github.com/huggingface/peft/issues/92#issuecomment-1431227939 ). The reply reads to me as if they are implementing the same featureset, unaware that it is the same one.

I would like to know why this didn‘t go better. The person who spearheaded AdapterHub for years appears to be one of the most prominent PEFT researchers with published papers. It looks as if they are tossed out in the snow. I can only imagine management never learned of the outreach or equally likely they have no idea how to work with other projects to refactor concepts from multiple codebases together or don’t find it to be a good thing to do so. It would have been nice to at least see lip service paid.

The library and hub are not complex. Is there a community alternative conducive to code organization or do we need to start yet another?

Sometimes I think it would make sense to train language models to transform the code, organize it, merge things, using techniques like langchain and chatgpt, to integrate future work into a more organized system.

Projects where everyone can work together are best.

6