CKtalon t1_jdhkdgc wrote on March 24, 2023 at 1:24 PM

Reply to Cuda out of memory error by Rishh3112

Code seems fine unless your batch size is too huge. Try running on CPU and see how much RAM is used and debug from there?

CKtalon t1_jctb1c0 wrote on March 19, 2023 at 12:10 PM

Reply to Best GPUs for pretraining roBERTa-size LLMs with a $50K budget, 4x RTX A6000 v.s. 4x A6000 ADA v.s. 2x A100 80GB by AngrEvv

Do not be tricked by memory pooling. NVLink might not really improve performance on the A6000s by much (different case for the A100s)

I think it will be a tough choice between 2xA100/ and 4x 6000 Ada

CKtalon t1_jc9hm91 wrote on March 15, 2023 at 6:41 AM

Reply to [D] Choosing Cloud vs local hardware for training LLMs. What's best for a small research group? by PK_thundr

Don't think a 40K budget can get you a machine with 256GB VRAM. It's barely enough to get 8xRTX6000 Ada, and that's ignoring how you would need a high-end workstation/server-grade CPU/motherboard to support 8 cards.

CKtalon t1_jbnccl7 wrote on March 10, 2023 at 7:25 AM

Reply to [D] Is it possible to train LLaMa? by New_Yak1645

If you have a few thousand A100s, sure? The dataset is fairly easily obtainable.

The next difficulty is the technical knowhow to train such LLMs.

CKtalon t1_jbdjaxa wrote on March 8, 2023 at 7:02 AM

Reply to comment by Taenk in [D] Can someone explain the discrepancy between the findings of LLaMA and Chinchilla? by __Maximum__

Instead of choosing a huge model and having it undertrained due to limited compute budget, choose the small but biggest model for your compute budget using their estimates. It doesn’t necessarily mean that a small model trained with larger datasets will naturally beat a bigger model.

CKtalon t1_jbaogg3 wrote on March 7, 2023 at 6:06 PM

Reply to [D] Can someone explain the discrepancy between the findings of LLaMA and Chinchilla? by __Maximum__

Chinchilla just says that for a given compute, what is the optimal amount of data to train on to give the best bang for your buck. It doesn’t mean that the model converges to ‘best performance’ once it reaches the Chinchilla-optimal token count. Ergo, you can keep training if you have plenty of budget

CKtalon t1_jacill4 wrote on February 28, 2023 at 1:29 PM

Reply to Dual RTX3090 vs single 4090 for deep learning by jnfinity

Since you are working on smaller experiments, single 4090. NVLink is overhyped

CKtalon t1_jabpgds wrote on February 28, 2023 at 7:24 AM

Reply to [D] Training transformer on RTX2060 by ahiddenmessi2

A ~10^7-10^8 parameter model should be possible.

CKtalon t1_jabg9b5 wrote on February 28, 2023 at 5:35 AM

Reply to comment by Etterererererer in [P] [R] Neural Network in Fortran! by Etterererererer

Fortran is pretty much a dead language though. People still use Fortran in their DFT computation only because no one has ported them over to a modern language. Since you are picking up a new language, just pick up Python to get the required support or you will be having a lot of trouble finding help.

CKtalon t1_jabc6cm wrote on February 28, 2023 at 4:54 AM

Reply to [P] [R] Neural Network in Fortran! by Etterererererer

Sounds like these were written back in the days when GPUs didn’t exist. I believe a consumer GPU these days can beat any supercomputer of the past

CKtalon t1_j9r2k9j wrote on February 23, 2023 at 11:19 PM

Reply to [P] What are the latest "out of the box solutions" for deploying the very large LLMs as API endpoints? by johnhopiler

Probably FasterTransformers with Triton Inference Server

CKtalon t1_j9kfuhx wrote on February 22, 2023 at 5:10 PM

Reply to comment by buzzz_buzzz_buzzz in Bummer: nVidia stopping support for multi-gpu peer to peer access with 4090s by mosalreddit

Funny how the RTX 6000 ADA doesn’t have NVLink as well

CKtalon t1_j8e0j48 wrote on February 13, 2023 at 4:47 PM

Reply to comment by N3urAlgorithm in GPU comparisons: RTX 6000 ADA vs Hopper h100 by N3urAlgorithm

You’ll have to use something like DeepSpeed to split the layers across multiple GPUs. Of course, if the model can fit on one GPU, then you can go to crazier with bigger batch sizes

CKtalon t1_j8dbtpk wrote on February 13, 2023 at 1:51 PM

Reply to GPU comparisons: RTX 6000 ADA vs Hopper h100 by N3urAlgorithm

RTX 6000 Ada has no NVLink. Speedwise, 2x RTX 6000 Ada should be ~ 1x H100 based on last gen's A6000 vs A100. 4x RTX 6000 should be faster, and has more VRAM than a single H100.

Thing to take note is the likely lack of a Tensor Memory Accelerator on the RTX 6000 Ada which is present on the H100—if you plan on training FP8 models.

CKtalon t1_j7j14ab wrote on February 7, 2023 at 3:28 AM

Reply to Wouldn’t it be a good idea to bring a more energy efficient language into the ML world to reduce the insane costs a bit?[D] by thedarklord176

Most inference/mlops solutions don’t really use Python despite being used to develop the model.

Stuff like Nvidia’s Triton inference server is used for speed up.

CKtalon t1_j7eba6d wrote on February 6, 2023 at 3:58 AM

Reply to [D] Yann Lecun seems to be very petty against ChatGPT by supersoldierboy94

Whatever Meta has put out in the past year has been fairly disappointing compared to what's already available—OPT, NLLB, Galactica. It probably advanced the field with the knowledge gleaned from producing these models, but for production, they all feel half-baked and lack polish. It was like they were just rushing out something to meet some KPI.

So yes, I find Lecun being petty that his team can't seem to produce something 'good' to the general public.

CKtalon t1_j70ht51 wrote on February 3, 2023 at 4:38 AM

Reply to [D] Is computer science one of the most threatened jobs due to AI? by Suspicious-Spend-415

Basically job scopes will change due to the boost in efficiency.

The mediocre of any field will potentially be kicked out or priced out by AI.

More domain experts will be needed to vet the AI output and guide the improvement of AI (using RLHF) for probably decades to come. Generalists will likely be replaced by AI with time.

CKtalon t1_j695owv wrote on January 28, 2023 at 5:20 PM

Reply to comment by NoFairYouCheated in [R] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot by Secure-Technology-78

No. There are blog posts about it performing quite badly: https://www.surgehq.ai/blog/how-good-is-hugging-faces-bloom-a-real-world-human-evaluation-of-language-models

Then based on the Chinchilla paper, you can kind of infer that it's a result of undertraining.

CKtalon t1_j682dxf wrote on January 28, 2023 at 11:50 AM

Reply to comment by PleasantBase6967 in [D] Laptop recommendations for ML by PleasantBase6967

Don’t bother. mps support is terrible. Tensorflow GPU support is better in comparison.

However, the MBA is good for fast and efficient cpu prototyping which you should ship off to a Linux running workstation or cloud with discrete Nvidia GPUs.

CKtalon t1_j62n9yw wrote on January 27, 2023 at 7:17 AM

Reply to comment by data-drone in [R] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot by Secure-Technology-78

About 10-12 times more then the tokens seen.

CKtalon t1_j62hmsr wrote on January 27, 2023 at 6:11 AM

Reply to [R] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot by Secure-Technology-78

Before people get their hopes up, BLOOM and OPT are known to be seriously undertrained (not Chinchilla-optimal, BLOOM more so than OPT), so it’s possible that most of the weights were useless to begin with. The results of this paper seem to imply that.

CKtalon t1_j62c6t5 wrote on January 27, 2023 at 5:15 AM

Reply to comment by madmax_br5 in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5

GPT can already model multiple languages with 30k vocabulary, just at the cost of high token count per (non-English) word. So increasing to 200k, will ease most of the burden. It won’t completely make other languages be at parity with English definitely since there’s ultimately a hard limit to that language’s corpus.

CKtalon t1_j625s3n wrote on January 27, 2023 at 4:17 AM

Reply to [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5

The tokenization just saw a predominantly English corpus, so it naturally tokenised most common English words and left words from other languages in different sub word form.

They could increase the vocabulary size to something like 250000 from the current 30+k, but that would require retraining

CKtalon t1_j5y9deu wrote on January 26, 2023 at 11:57 AM

Reply to comment by manubfr in Few questions about scalability of chatGPT [D] by besabestin

There's also the rumor mill that Whisper was used to gather a bigger text corpus from videos to train GPT 4.

CKtalon t1_j5y87e5 wrote on January 26, 2023 at 11:44 AM

Reply to comment by manubfr in Few questions about scalability of chatGPT [D] by besabestin

People often quote Chinchilla about performance, claiming that there's still a lot of performance to be unlocked when we do not know how GPT 3.5 was trained. GPT 3.5 could very well be Chinchilla-optimal, even though the 1st version of davinci was not Chinchilla-optimal. We know that OpenAI has retrained GPT 3 due to the increased context length going from 2048 to 4096 to the apparent 8000ish tokens for ChatGPT.