Viewing a single comment thread. View all comments

Complex_Candidate_28 t1_j675z5i wrote

The purpose of the experiments is not to compare the performance between them. The goal is to compare the mechanisms behind them. So it doesn't affect the conclusion itself. The point is to use the same set of examples for analysis.

1

cthorrez t1_j67aa39 wrote

If the goal is the mechanism rather than the performance why tune the seed for performance in the first place? The examples used doesn't change the mechanism.

2

Complex_Candidate_28 t1_j67aytx wrote

Because for small-size LMs, ICL is unstable, i.e., it sometimes degrades to classifying all examples into one category. The protocol tries to ensure analyzing ICL when it works well. (For much larger-size LMs, the performance variance would be much smaller, where this step can be ignored.)

1

cthorrez t1_j67csjx wrote

That's an interesting topic that I think deserves further investigation. On the surface it sounds like the size of the LM impacts the mechanism by which the LM is able to "secretly perform gradient descent".

Is finetuning similarly unstable for small sized LMs?

1