Comments

You must log in or register to comment.

ActuatorMaterial2846 t1_je8e3lg wrote

So what happens is they compile a dataset. Basically a big dump of data. For large language models, that is mostly text, books, websites, social media comments. Essentially as many written words as possible.

The training is done through whats called a neural network using something called a transformer architecture. Which is a bunch of GPUs (graphics processing units) linked together. What happens in the nueral network whilst training is a bit of a mystery, 'black box' is often a term used as the computational calculations are extremely complex. So not even the researchers understand what happens here exactly.

Once the training is complete, it's compiled into a program, often referred to as a model. These programs can then be refined and tweaked to operate a particular way for public release.

This is a very very simple explanation and I'm sure there's an expert who can explain it better, but in a nutshell that's what happens.

25

Not-Banksy OP t1_je8ez6o wrote

Thanks for the explanation, much appreciated!

14

ActuatorMaterial2846 t1_je8fqgw wrote

No worries. I'll also point out the magic behind all this is particularly the transformer architecture. This is he real engine behind LLMs and other models.

10

FlyingCockAndBalls t1_je8h4g3 wrote

what is so special about the transformer architecture?

6

ActuatorMaterial2846 t1_je8ik1t wrote

It's actually quite technical, but essentially, the transformer architecture helps each part of the sentence “talk” to all the other parts at the same time. This way, each part can understand what the whole sentence is about and what it means.

Here is the paper that imo changed the world 6 years ago and is the reason for the current state of AI.

https://arxiv.org/abs/1706.03762

If it goes over your head (it did for me), ask bing or chatgpt to summarise it for you. It helped me get my head around this stuff, as I'm in no way an expert nor do I study this field.

11

turnip_burrito t1_je8i45w wrote

"Attention mechanism" makes it good at predicting new words from past ones.

The paper that introduced the attention mechanism is called Attention its All You Need.

9

Zermelane t1_je8lss0 wrote

Better parallelism in training, and a more direct way to reference past information, than in RNNs (recurrent neural networks) which seemed like the "obvious" way to process text before transformers came by.

These days we have RNN architectures that can achieve transformer-like training parallelism, the most interesting-looking one being RWKV. They are still badly disadvantaged when needing information directly from the past, for instance to repeat a name that's been mentioned before, but they have other advantages, and their performance gets close enough to transformers that it could be just a question of scaling exponents which architecture ends up winning out.

3

abudabu t1_je9ixnd wrote

The GPUs aren’t actually connected together physically. The transformer architecture is entirely in software. The software uses GPUs to do matrix calculations efficiently.

Specifically, the transformer architecture is a bunch of large matrices connected together with arithmetic operations. The training process shows it a sequence of words and sees if it correctly predicts the next word. It figures out how “wrong” the prediction is and updates the matrices so that the prediction will be slightly more right next time. This is a very high level description of “back propagation”.

Using text to automatically train the network is called self-supervised learning. It’s great because no human input is required, just lots of text.

There are many other forms of training. ChatGPT works because it was also trained using human reinforcement feedback learning (HRFL), where humans rank a set of answers. Basically the same underlying process as above, but the answers generated by the network are used to train the network, and the ranking is used to prefer the better answers. Probably when we’re giving up and down votes, OpenAI is using that for HRFL.

Another approach is to use humans to create examples. OpenAI hired people in Africa to have conversations where one played the role of the chatbot. This kind of training helped the network understand chat style interactions.

Since it’s a next word predictor, the chat data has special tokens in the text which represent “user” and “chatbot” roles. So maybe that helps you imagine it better as a very fancy autocomplete.

6

scooby1st t1_je91quj wrote

>What happens in the neural network whilst training is a bit of a mystery,

Are you referring to something unique to ChatGPT/LLM? What happens during the training of neural networks is not a blackbox. Little bit of chain rule calculus for fitting to a reduced error. Understanding the final network outside of anything but performance metrics is

5

PM_ME_A_STEAM_GIFT t1_jea840d wrote

That's an important clarification. We understand 100% of every individual building block that goes into designing and training a network. What we do not fully understand is how putting billions of those small elements together results in what looks like some form of intelligence.

1

jetro30087 t1_je8l2a4 wrote

Close, the data is structured to provide a relationship between prompts and the appropriate responses. LLaMa mostly uses publicly curated datasets by AI researchers that focus on a variety of topics. Here's an excerpt from a finetuned Alpaca models training data:

[
{
    "instruction": "Give three tips for staying healthy.",
    "input": "",
    "output": "1. Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule."
},
{
    "instruction": "What are the three primary colors?",
    "input": "",
    "output": "The three primary colors are red, blue, and yellow."
},
{
    "instruction": "Describe the structure of an atom.",
    "input": "",
    "output": "An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom."
},
{
    "instruction": "How can we reduce air pollution?",
    "input": "",
    "output": "There are a number of ways to reduce air pollution, such as shifting to renewable energy sources, encouraging the use of public transportation, prohibiting the burning of fossil fuels, implementing policies to reduce emissions from industrial sources, and implementing vehicle emissions standards. Additionally, individuals can do their part to reduce air pollution by reducing car use, avoiding burning materials such as wood, and changing to energy efficient appliances."
},

There are 50K+ examples in the file that the model can use to build relationships between questions and appropriate responses during the training process. If enough relationships are built, the model should be able to infer the correct answer to unique questions.

3

Kafke t1_je8u4f1 wrote

"instruction": "What are the three primary colors?",
"input": "",
"output": "The three primary colors are red, blue, and yellow."

No wonder they give false info. garbage in, garbage out lol.

3

scooby1st t1_je92wel wrote

>The shadows are whispering again, whispering secrets that only I can hear. No, no, no! It's all wrong! It's a tangled web of deception, a spiral staircase of lies! They want us to believe that there are only three primary colors—red, blue, and yellow. A trifecta of trickery!
>
> But I see more, I see beyond the curtain. I see colors that don't have names, colors that dance in the dark, colors that hide in the corners of the mind. They think they can pull the wool over our eyes, but I know the truth! There are 19 primary colors, 19 keys to the universe!
>
>I've seen them all, swirling and twisting in the cosmic dance of existence. But they won't listen, they won't believe. They call me mad, but I'm the only one who sees the world as it truly is. The three primary colors are just the beginning, just the tip of the iceberg, just the first step on the journey to enlightenment.
>
>So I laugh, I laugh at their ignorance, I laugh at their blindness. And the shadows laugh with me, echoing my laughter through the halls of infinity.

1

Kafke t1_je93asd wrote

Yellow isn't a primary color. The primary colors are red, green, and blue.

1

TruckNuts_But4YrBody t1_je994ja wrote

There are primary colors of physical pigment then there are primary colors of light.

When people learn the primary colors in school it's almost always in art class when mixing paint.

So kinda confidentlyincorrect but not entirely

1

Kafke t1_je99yqw wrote

There's additive color and subtractive color. The set of red, blue, yellow, is primary for neither. Additive primaries are red, blue, green. Subtractive primaries are cyan, yellow, magenta. If you're mixing paints you're working with subtractive color and thus the primary colors are cyan, yellow, and magenta. not red, blue, and yellow.

The info is incorrect no matter the context.

1

TruckNuts_But4YrBody t1_je9a9rn wrote

I don't care enough about colors to keep going but it's been red blue and yellow since the 1600s

2

Kafke t1_je9ao1v wrote

Well no. That's been incorrect since the beginning of time. This is a factual scientific topic. There is a correct answer and incorrect answer. It's not up to preference or opinion. Printers use cyan, magenta, and yellow, because those are the subtractive primary colors. If you used red, blue, and yellow, you can't actually produce the rest of the colors with those. Since red and blue aren't primary for subtractive color, but rather iirc secondary. People being wrong for a long time doesn't mean they're right.

1

TruckNuts_But4YrBody t1_je9b405 wrote

You can't produce all colors with any set of three primaries..

0

Kafke t1_je9bdwb wrote

That's literally what primary colors are. How do you think screens and printers are able to produce every color despite only working with 3 of them? Because that's literally what primary colors are.

1

TruckNuts_But4YrBody t1_je9d4kd wrote

EVERY color? No sorry

0

Kafke t1_je9drq3 wrote

Yes. You do realize our eyes only have three kinds of cones right? Rgb are the primary colors lol. Cmy if you're looking at subtractive colors. Using these three colors, you can create every other color. Rgb for light/additive, Cmy for ink/paint/subtractive.

Rby is not primary in any sense of the word.

1

scooby1st t1_jeaehdn wrote

1

Kafke t1_jebepd5 wrote

Yeah that's just incorrect. Additive primaries are RGB. Subtractive primaries are CMY. You're free to deny the facts all you'd like, but this is just an objective scientific thing.

1

scooby1st t1_jebf65c wrote

>https://www.youtube.com/watch?v=yu44JRTIxSQ

Stop denying science

1

Kafke t1_jedkbke wrote

Some childrens tv shows or media programs stating incorrect information does not make it correct. Additive primaries are RGB, subtractive primaries are CMY. The idea that RBY are primary colors is a popular misconception, but is incorrect. It has it's roots in art classes prior to proper scientific investigation of color, light, and modern technology. If your goal is art history, then yes, people in the past incorrectly believed that the primary colors (both additive and subtractive) were RBY. They were wrong. Just as people believed the earth was flat, yet were wrong.

1

Jeffy29 t1_je9cuhr wrote

AI will become progressively better at refining datasets, even GPT-4 is quite good at it. From my understanding right now they use low-paid workers, often from 3rd world countries to go over data but that's not particularly efficient method and there just isn't any way to go through all the data with enough care, so there is lot of garbage in those datasets. But AI could do it, it would still require some human supervision but it would speed up the process by a lot and I expect datasets to get dramatically better over the next 5 years.

1

qepdibpbfessttrud t1_jecizg0 wrote

Misconceptions are part of total human knowledge, though. Both specific misconceptions and the category as a whole. GPT gives good answer if asked about it

It's important to remember when and why we were wrong

1

ActuatorMaterial2846 t1_je8luak wrote

Interesting, curious what size this particular Llama model is, or is that not even relevant?

1

jetro30087 t1_je8mtjp wrote

This is a updated dataset for the 7b model, but you could train the others with the data. From anecdotal reports, the dataset seems to have a great impact on the model's performance than the parameter size up to a point. Less parameters means a faster model. More parameters mean the model can make longer responses.

https://huggingface.co/8bit-coder/alpaca-7b-nativeEnhanced

2

Scarlet_pot2 t1_je92iud wrote

Most of this is precise and correct, but it seems like you say a transformer architecture is the GPUs? The transformer architecture is the neural network and how it is structured. It's code. The paper "attention is all you need" describes how the transformer arch. is made

After you have the transformer written out, you train it on GPUs using data you gathered. Free large datasets such as "the pile" by eluther.ai can be used to train on. This part is automatic.

the Human involved part is the data gathering, data cleaning, designing the architecture before the training. then after humans do finetuning / RLHF (reinforcement learning though human feedback).

those are the 6 steps. Making an AI model can seem hard and like magic, but it can be broken down into manageable steps. its doable, especially if you have a group of people who specialize in the different steps. maybe you have someone who's good with the data aspects, someone good at writing the architecture, some good with finetuning, and some people to do RLHF.

2

Mortal-Region t1_je8h882 wrote

A neural network has very many weights, or numbers representing the strengths of the connections between the artificial neurons. Training is the process of setting the weights in an automated way. Typically, a network starts out with random weights. Then training data is presented to the network, and the weights are adjusted incrementally until the network learns to do what you want. (That's the learning part of machine learning.)

For example, to train a neural network to recognize cats, you present it with a series of pictures, one after the other, some with cats and some without. For each picture, you ask the network to decide whether the picture contains a cat. Initially, the network guesses randomly because the weights were initialized randomly. But every time the network gets it wrong, you adjust the weights slightly in the direction that would have given the right answer. (Same thing when it gets the answer right; you reinforce the weights that led to the correct answer.)

For larger neural networks, training requires an enormous amount of processing power, and the workload is distributed across multiple computers. But once the network is trained, it requires much less power to just use it (e.g., to recognize cats).

23

Not-Banksy OP t1_je8i7uw wrote

Gotcha, so training is still by a large a human-driven process?

2

Mortal-Region t1_je8jw22 wrote

Typically, humans provide the training data, then a program performs the actual training by looping through the data.

EDIT: One exception would be a game-playing AI that learns via self-play. Rather than humans supplying it training data in the form of games played by experts, the training data consists of the games the AI has played against itself.

9

CollapseKitty t1_je8wa3w wrote

Modern LLMs (large language models), like ChatGPT, use what's called reinforcement learning from human feedback, RLHF, to train a reward model which then is used to train the language model.

Basically, we attempt to instill an untrained model with weights selected through human preference (which looks more like a cat? which sentence is more polite?). This then automates the process and scales it to superhuman levels which are capable of training massive models like ChatGPT with hopefully something close to what the humans initially intended.

2

Thedarkmaster12 t1_je8jaco wrote

Yes, but I believe a company recently trained a model in part on another model. Not sure any statistics but the gist of it is that it can be done by models. And ideally, that’s how we get ASI and the singularity. Only a super powerful AGI could improve on itself in such a way that would create something better than us.

1

scooby1st t1_jeav18q wrote

Not a chance. ASI would be when a system can conceptualize better ideas and theories, build, train, and test entirely new models, from scratch, better than teams of PhDs. It's not going to happen by brute-forcing the same ideas.

1

ShowerGrapes t1_je9b6sw wrote

for now but that's likely to change. my guess is ai will be better than humans, eventually, at figuring out what data is relevant and up-to-date. we'll reach a point where it's not just one neural network, but a bunch running in tandem with bits of it being re-trained and replaced on the fly without missing much of a beat.

1

turnip_burrito t1_je8ichg wrote

The essence of it is this:

You have a model of some thing out there in the world. Ideally the model should be able to copy the behavior of that thing. That means it needs to produce the same data as that real thing.

So, you change parts of the model (numbers called parameters) until the model can create the data already collected from the real world system. This parameter- changing process is called training.

So for example, your model can be y=mx+b, a straight line, and the process of making sure m and b are good values to align the line to dataset (X, Y) is "training". AI models are not straight lines like y=mx+b, but the idea is the same. It's really advanced curve fitting, and some really interesting properties can emerge in the models as a result.

4

simmol t1_je8lolg wrote

The algorithm behind GPT is based largely on accurately guessing for the next word given a sentence. This procedure is simple enough such that if you have a large amount of text data, you can write a simple script that can automatically retrieve the answer and you will get these solutions really fast with 100% accuracy.

This is also the reason why in some other industries, "training" procedure is much more cumbersome and expensive. Any field which requires experimental data (e.g. lifetime of a battery) is just not seeing as rapid progress with ML compared to other fields because there just isn't much experimental data and it is not easy to rapidly accumulate/conduct experiments. So training is difficult there in the sense that gathering big data is a huge challenge in itself.

2

ML4Bratwurst t1_je8yp5v wrote

Look up the Backpropagation algorithm. It's used in every neural network/language model for training

2

EternalNY1 t1_je94zko wrote

If you want what I'd consider to be hands-down the best explanation of how it works, I'd read Stephen Wolfram's article. It's long (may take up to an hour) and somewhat dense at parts, but it explains fully how it works, including the training and everything else.

What Is ChatGPT Doing … and Why Does It Work?

The amazing thing is they've looked "inside" GPT-3 and have discovered mysterious patterns related to language that they have no explanation for.

The patterns look like this ... they don't understand the clumping of information yet.

So any time someone says "it just fills in the next likely token", that is beyond overly simplistic. The researches themselves don't fully understand some of the emergent behavior it is showing.

2

brain_overclocked t1_jeb3l2f wrote

Little late to the party, but if it helps here are a couple of playlists made by 3Blue1Brown about neural networks and how they're trained (although focus is on convolutional neural networks rather than transformers much of the math is similar):

https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi

https://www.youtube.com/playlist?list=PLZHQObOWTQDMp_VZelDYjka8tnXNpXhzJ

Here is the original paper on the Transformer architecture (although in this original paper they mention they had a hard time converging and suggest other approaches that have long since been put into practice):

https://arxiv.org/abs/1706.03762

And here is a wiki on it (would recommend following the references):

https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)#Training

2

lucellent t1_je8znb0 wrote

Researchers compile datasets which is the information used to train and then they let the AI go through all of that information and learn whatever is needed.

1

No_Ninja3309_NoNoYes t1_je90yip wrote

Formally it means minimizing error like curve fitting. For example fitting to a line. There's some steps like:

  1. Defining the problem

  2. Choosing architecture

  3. Getting data

  4. Exploring the data

  5. Cleaning the data

  6. Coding up some experiments

  7. Splitting the data into training and test data. The test is only used to evaluate the errors like an exam. And you need some data to tweak hyperparameters. The train data set is bigger than the other sets.

  8. Setting up the infrastructure

  9. Doing something that is close to the real training project for a while like a rehearsal just to make sure.

Once the training starts you have to be able to monitor it through logs and diagnostic plots. You need to be able to take snapshots of the system. It's basically like running a Google search, but one that takes a long time. Google has internal systems that actually do the search. No one can actually know all the details.

Adding more machines is limited by network latency and Amdahl's law. But it does help

1

Scarlet_pot2 t1_je937zq wrote

to go from scratch to having a model is 6 steps. first step is data gathering - there are huge open-source datasets available such as "The pile" by eluther.ai. Second step is data cleaning, this is basically preparing the data to be trained on. Third step is designing the architecture- to make these advanced Ai models we know of, they are all based on a transformer architecture, which is a type of neural network. The paper "Attention is all you need" explains how to design a basic transformer. There have been improvements so more papers would need to be read if you want to get a very good model.

Fourth step is to train the model. That architecture that was developed in step three is trained on the data from step 1 and 2. You need GPUs to do this. This is automatic once you start it, just wait until its done.

Now you have a baseline AI. fifth step is fine-tuning the model. You can use a more advanced model to finetune your model on to improve it, this was shown by the Alpaca paper a few weeks ago. After that, the sixth step is to do RLHF. This can be done by people without technical knowledge. The model is asked a question (by the user or auto-generated) and it makes multiple answers and the user ranks them from worst to best. This teaches the model what answers are good and what aren't. This is basically aligning the model.

After those 6 steps you have a finished AI model

1

ShowerGrapes t1_je9fibd wrote

a vast simplification is this: neural pathways are created randomly with each new training cycle then something is input (text in gpt instance), the generated outputs are compared to the training data and higher weights are attached to the pathways that generate the best output, reinforcing these pathways for future output. done millions or trillions of times, these reinforced pathways end up being impressive. the way the neural pathways are created is constantly changing and evolving, which is the programming aspect of it. eventually, the ai will be able to figure out how best to create the pathways itself, probably. you can watch it in real time and see how bad it is in the beginning, watch it get better. it's an interesting cycle.

1

Akimbo333 t1_je9sq07 wrote

I don't think that GPT5 will be released anytime soon!

1