Submitted by fraktall t3_11b4bim in singularity
Some notes from HN user saurabh20n:
Quick notes from first glance at paper https://research.facebook.com/publications/llama-open-and-ef...:
* All variants were trained on 1T - 1.4T tokens; which is a good compared to their sizes based on the Chinchilla-metric. Code is 4.5% of the training data (similar to others). [Table 2]
* They note the GPU hours as 82,432 (7B model) to 1,022,362 (65B model). [Table 15] GPU hour rates will vary, but let's give a range of $1 to $4. The 7B model would have cost ~$82-329k and the 65B something in the range of ~$1-4M. They also note their total time spent for all models: "we used 2048 A100-80GB for a period of approximately 5 months" [sec 6, pg 10]
* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.
* Math and code tasks: Math tasks they are substantially worse than Minerva (comparing their 65B to Minerva 62B; they hands down fail against Minerva 540B) [Table 7]. Code tasks they are broadly competitive with PALM-540B (HumanEval and MBPP evals) [Table 8]
* Surprising that instruction fine tuning takes such a small part of the paper (sec 4, pg. 7)
Announcement link: https://ai.facebook.com/blog/large-language-model-llama-meta-ai/
Easyldur t1_j9wpkdf wrote
Thanks for the detailed explanation.
This could follow the path of Stable Diffusion: a smaller, open source model comparable to the bigger Dall-e in performance, which in turn gave birth to the more-than-exceptional Midjourney.
Let's see!