CKtalon
CKtalon t1_jctb1c0 wrote
Reply to Best GPUs for pretraining roBERTa-size LLMs with a $50K budget, 4x RTX A6000 v.s. 4x A6000 ADA v.s. 2x A100 80GB by AngrEvv
Do not be tricked by memory pooling. NVLink might not really improve performance on the A6000s by much (different case for the A100s)
I think it will be a tough choice between 2xA100/ and 4x 6000 Ada
CKtalon t1_jc9hm91 wrote
Reply to [D] Choosing Cloud vs local hardware for training LLMs. What's best for a small research group? by PK_thundr
Don't think a 40K budget can get you a machine with 256GB VRAM. It's barely enough to get 8xRTX6000 Ada, and that's ignoring how you would need a high-end workstation/server-grade CPU/motherboard to support 8 cards.
CKtalon t1_jbnccl7 wrote
Reply to [D] Is it possible to train LLaMa? by New_Yak1645
If you have a few thousand A100s, sure? The dataset is fairly easily obtainable.
The next difficulty is the technical knowhow to train such LLMs.
CKtalon t1_jbdjaxa wrote
Reply to comment by Taenk in [D] Can someone explain the discrepancy between the findings of LLaMA and Chinchilla? by __Maximum__
Instead of choosing a huge model and having it undertrained due to limited compute budget, choose the small but biggest model for your compute budget using their estimates. It doesn’t necessarily mean that a small model trained with larger datasets will naturally beat a bigger model.
CKtalon t1_jbaogg3 wrote
Reply to [D] Can someone explain the discrepancy between the findings of LLaMA and Chinchilla? by __Maximum__
Chinchilla just says that for a given compute, what is the optimal amount of data to train on to give the best bang for your buck. It doesn’t mean that the model converges to ‘best performance’ once it reaches the Chinchilla-optimal token count. Ergo, you can keep training if you have plenty of budget
CKtalon t1_jacill4 wrote
Since you are working on smaller experiments, single 4090. NVLink is overhyped
CKtalon t1_jabpgds wrote
Reply to [D] Training transformer on RTX2060 by ahiddenmessi2
A ~10^7-10^8 parameter model should be possible.
CKtalon t1_jabg9b5 wrote
Reply to comment by Etterererererer in [P] [R] Neural Network in Fortran! by Etterererererer
Fortran is pretty much a dead language though. People still use Fortran in their DFT computation only because no one has ported them over to a modern language. Since you are picking up a new language, just pick up Python to get the required support or you will be having a lot of trouble finding help.
CKtalon t1_jabc6cm wrote
Reply to [P] [R] Neural Network in Fortran! by Etterererererer
Sounds like these were written back in the days when GPUs didn’t exist. I believe a consumer GPU these days can beat any supercomputer of the past
CKtalon t1_j9r2k9j wrote
Reply to [P] What are the latest "out of the box solutions" for deploying the very large LLMs as API endpoints? by johnhopiler
Probably FasterTransformers with Triton Inference Server
CKtalon t1_j9kfuhx wrote
Reply to comment by buzzz_buzzz_buzzz in Bummer: nVidia stopping support for multi-gpu peer to peer access with 4090s by mosalreddit
Funny how the RTX 6000 ADA doesn’t have NVLink as well
CKtalon t1_j8e0j48 wrote
Reply to comment by N3urAlgorithm in GPU comparisons: RTX 6000 ADA vs Hopper h100 by N3urAlgorithm
You’ll have to use something like DeepSpeed to split the layers across multiple GPUs. Of course, if the model can fit on one GPU, then you can go to crazier with bigger batch sizes
CKtalon t1_j8dbtpk wrote
RTX 6000 Ada has no NVLink. Speedwise, 2x RTX 6000 Ada should be ~ 1x H100 based on last gen's A6000 vs A100. 4x RTX 6000 should be faster, and has more VRAM than a single H100.
Thing to take note is the likely lack of a Tensor Memory Accelerator on the RTX 6000 Ada which is present on the H100—if you plan on training FP8 models.
CKtalon t1_j7j14ab wrote
Reply to Wouldn’t it be a good idea to bring a more energy efficient language into the ML world to reduce the insane costs a bit?[D] by thedarklord176
Most inference/mlops solutions don’t really use Python despite being used to develop the model.
Stuff like Nvidia’s Triton inference server is used for speed up.
CKtalon t1_j7eba6d wrote
Whatever Meta has put out in the past year has been fairly disappointing compared to what's already available—OPT, NLLB, Galactica. It probably advanced the field with the knowledge gleaned from producing these models, but for production, they all feel half-baked and lack polish. It was like they were just rushing out something to meet some KPI.
So yes, I find Lecun being petty that his team can't seem to produce something 'good' to the general public.
CKtalon t1_j70ht51 wrote
Basically job scopes will change due to the boost in efficiency.
The mediocre of any field will potentially be kicked out or priced out by AI.
More domain experts will be needed to vet the AI output and guide the improvement of AI (using RLHF) for probably decades to come. Generalists will likely be replaced by AI with time.
CKtalon t1_j695owv wrote
Reply to comment by NoFairYouCheated in [R] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot by Secure-Technology-78
No. There are blog posts about it performing quite badly: https://www.surgehq.ai/blog/how-good-is-hugging-faces-bloom-a-real-world-human-evaluation-of-language-models
Then based on the Chinchilla paper, you can kind of infer that it's a result of undertraining.
CKtalon t1_j682dxf wrote
Reply to comment by PleasantBase6967 in [D] Laptop recommendations for ML by PleasantBase6967
Don’t bother. mps support is terrible. Tensorflow GPU support is better in comparison.
However, the MBA is good for fast and efficient cpu prototyping which you should ship off to a Linux running workstation or cloud with discrete Nvidia GPUs.
CKtalon t1_j62n9yw wrote
Reply to comment by data-drone in [R] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot by Secure-Technology-78
About 10-12 times more then the tokens seen.
CKtalon t1_j62hmsr wrote
Reply to [R] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot by Secure-Technology-78
Before people get their hopes up, BLOOM and OPT are known to be seriously undertrained (not Chinchilla-optimal, BLOOM more so than OPT), so it’s possible that most of the weights were useless to begin with. The results of this paper seem to imply that.
CKtalon t1_j62c6t5 wrote
Reply to comment by madmax_br5 in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
GPT can already model multiple languages with 30k vocabulary, just at the cost of high token count per (non-English) word. So increasing to 200k, will ease most of the burden. It won’t completely make other languages be at parity with English definitely since there’s ultimately a hard limit to that language’s corpus.
CKtalon t1_j625s3n wrote
Reply to [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
The tokenization just saw a predominantly English corpus, so it naturally tokenised most common English words and left words from other languages in different sub word form.
They could increase the vocabulary size to something like 250000 from the current 30+k, but that would require retraining
CKtalon t1_j5y9deu wrote
Reply to comment by manubfr in Few questions about scalability of chatGPT [D] by besabestin
There's also the rumor mill that Whisper was used to gather a bigger text corpus from videos to train GPT 4.
CKtalon t1_j5y87e5 wrote
Reply to comment by manubfr in Few questions about scalability of chatGPT [D] by besabestin
People often quote Chinchilla about performance, claiming that there's still a lot of performance to be unlocked when we do not know how GPT 3.5 was trained. GPT 3.5 could very well be Chinchilla-optimal, even though the 1st version of davinci was not Chinchilla-optimal. We know that OpenAI has retrained GPT 3 due to the increased context length going from 2048 to 4096 to the apparent 8000ish tokens for ChatGPT.
CKtalon t1_jdhkdgc wrote
Reply to Cuda out of memory error by Rishh3112
Code seems fine unless your batch size is too huge. Try running on CPU and see how much RAM is used and debug from there?