Submitted by Balance- t3_124eyso in MachineLearning
cegras t1_je0jsud wrote
Reply to comment by MrFlamingQueen in [N] OpenAI may have benchmarked GPT-4’s coding ability on it’s own training data by Balance-
Do you know if ChatGPT was allowed to ingest PDFs found on the internet? Even if not, I'm sure there are many sections of famous textbooks reproduced in HTML or parsable form.
ianitic t1_je0mjqx wrote
Oh I haven't tested this on textbooks, but I have asked chatGPT to give me pages of a novel and it did word for word. I suspect it had to have trained on PDFs? I'm highly surprised I haven't seen any news of authors/publishers suing yet tbh.
It is obvious when a book is a part of its training set or not though based on the above test.
currentscurrents t1_je12d3k wrote
Nobody knows exactly what it was trained on, but there exist several datasets of published books.
>I'm highly surprised I haven't seen any news of authors/publishers suing yet tbh.
They still might. But they don't have a strong motivation; it doesn't really directly impact their revenue because nobody's going to sit in the chatgpt window and read a 300-page book one prompt at a time.
mcilrain t1_je1a7cl wrote
Current tech could be used to allow you to ask an AI assistant to read you a book.
DreamWithinAMatrix t1_je3c6kl wrote
There was that time Google was taken to court for scanning and indexing books for Google Books or whatever and Google won:
https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,_Inc.
MrFlamingQueen t1_je0w3ut wrote
Not sure on the training corpus, but like you mentioned, there's ton of other forms of textbooks and solution manuals to textbook problems on things like github, stackexchange, etc.
mcilrain t1_je19vif wrote
Even if it didn't ingest PDFs it probably ingested websites that scraped PDFs to spam search engine results.
Viewing a single comment thread. View all comments