Submitted by Lajamerr_Mittesdine t3_ycipui in MachineLearning

Paper: https://arxiv.org/abs/2210.11610

Abstract:

>Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement.

200

Comments

You must log in or register to comment.

say_wot_again t1_itn823q wrote

From the abstract, it seems very similar to common self supervised techniques in computer vision. The difference is that in the case of computer vision SSL, you use the model's confident outputs on normal data to train its performance on heavily augmented data, whereas here you use the model's performance on "chain of thought" prompts to train its performance on normal prompts. But either way, the principle of "use the model's high confidence outputs on easy examples to train it on hard examples" stays the same. It's always cool to see this sort of cross pollination between vision and NLP, though the title seems designed to conjure up images of Westworld or Ex Machina.

Edit: it appears one massive difference is that in vision, the augmentation come from the modeler, whereas here the chains of thought actually come from the model's outputs. So it's leveraging the inherent randomness in LLM outputs to generate new training data by relying on the idea that answers that frequently appear in the output are likelier to be correct. This IS pretty cool, and meaningfully different from the vision SSL case insofar as it requires much less manual intervention.

61

DeezNUTSampler t1_itq1l2d wrote

Can you link works in Computer Vision SSL which incorporate this principle “use model’s high confidence outputs on easy examples to train it on hard examples”? It is not obvious to me how this would work. For example, in contrastive learning the objective is to learn view invariant representations. Two views of an object, augmented differently, are pushed together in representation space by minimizing the distance between them as our loss function. Which one would constitute the easy/hard example here?

5

say_wot_again t1_itrmhsx wrote

Here's an example of what I had in mind. Pseudolabels for unlabeled data are generated on the clean images, but the student model is trained on a strongly augmented version of the image. It's not contrastive learning because the objective is still explicitly object detection, but instead easy vs hard is the original image vs the strongly augmented one.

3

hiptobecubic t1_itoswbc wrote

Didn't the Greeks try this? It's a mess until you have an epiphany and realize that you have to verify the truth of a statement before you start building on top of it.

15

red75prime t1_itoxv2y wrote

Greeks arguably got rules of logic out of this.

10

Pwhids t1_itn9glu wrote

They show that the large LMSI models can be distilled into smaller models while maintaining accuracy, but I wonder what size model is necessary for the LMSI training itself to be viable. They only show results for 540B. Would be very curious to see a study here if there is a certain model size where this kicks in.

13

Material_Opening7336 t1_itmdpvs wrote

Very impressive. Thank you for sharing your paper. I will let you know if I have any questions

3

ReasonablyBadass t1_itolj2g wrote

Basic question: chain of thought prompting already generates it's own prompts for the next step, right? So this also generates answers?

3

Lajamerr_Mittesdine OP t1_itomfs6 wrote

CoT simply breaks down a problem into multiple interconnected solution statements to arrive at one conclusive answer.

You can prompt a CoT Model to go down different reasoning structures and arrive at different answers(but sometimes wrong) but those are all independent from one another.

Note that this is fine-tuning an existing LLM.

This fine-tuning is in part done by a hypermodel that helps rank solutions. These solutions are then used to fine-tune the model even further to become better reasoners using its own generated answers.

So the model uses its own understandings to generate CoT solution statements. The hypermodel would rank those statements and then the existing model can be fine-tuned on the newly generated positive and negative solutions reinforcing the idea of what correct solution statements look like and what negative ones look like as well.

Future work: So what is limiting the LLM model from eventually getting to 100%~ ? The bottleneck from preventing this going exponential is the hypermodel that can accurately rank the solution. Theoretically if you had a perfect ranker blackbox you could eventually get to 100%~. So what you would want in future work is either just a more accurate ranker overall or someway to continuously improve the ranker hypermodel in an unsupervised fashion just like we have this hypermodel for the LLM.

Personal Opinion: So what this really is doing is just solving some low hanging fruit in prompting the LLM in reasonings it already understands in different contexts and more finely puts them as the highest ranking solutions across a broader range. It's not learning new concepts entirely.

10

shazvaz t1_itnadll wrote

You want skynet? You want the singularity? This is how you get there.

Nice knowing ya folks.

−31