Viewing a single comment thread. View all comments

cegras t1_jdsd89g wrote

I don't see how it is possible to not end up just memorizing the internet, which is full of enough questions and discussions to simulate convincing Q&As. Consider if a team had invented an algorithm or heuristic to avoid data contamination (https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks). Then what you have is something that can separate content into logically similar, but orthogonal realizations. That would be an incredibe tool and worth a prize in its own right.

1

pengo t1_jdt6iv2 wrote

> Then what you have is something that can separate content into logically similar, but orthogonal realizations.

Like a word vector? The thing every language model is based on?

1

cegras t1_jdta9mj wrote

More like, the ability to know that 'reversing a linked list' and 'linked list cycle and traversal problems' are the same concepts but different problems, and to separate those into train/test. Clearly they haven't figured that out because ChatGPT is contaminated, and their (opaquely disclosed) ways of addressing that issue don't seem adequate at all.

3