gurenkagurenda t1_j9adicl wrote on February 20, 2023 at 1:57 PM

I cannot see any possible way to define fair use the way you’re saying which wouldn’t have massive unintended effects. If you want to propose that, you’re going to need to be a hell of a lot more specific than “dumping into an AI” when describing what you think should actually be prohibited.

bairbs t1_j9ag5if wrote on February 20, 2023 at 2:19 PM

Why not? Just say scraping is fine for research and private models. As soon as you release it to the public or try to monetize it, then it's outside of fair use. Just like Nintendo, when they go after passion project games that are similar in theme, style, and mechanics. You can't just take other people's work and make money off of it

gurenkagurenda t1_j9allgk wrote on February 20, 2023 at 3:01 PM

How do you define a model? What statistics are you and are you not allowed to scrape and publish? Comments like yours speak to a misunderstanding of what training is with respect to a work, which is simply nudging some numbers according to the statistical relationships within the text. That’s an incredibly broad category of operations.

For example, if I scrape a large number of pages, and analyze the number of incoming and outgoing links, and how those links relate to other links, in order to build a model that lets me match a phrase to a particular webpage and assess its relevance, is that fair use?

If not, you just outlawed search engines. If so, what principle are you using to distinguish that from model training?

Edit: Gotta love when someone downvotes you in less time than it would take to actually read the comment. Genuine discourse right there.

ImSuperHelpful t1_j9apjid wrote on February 20, 2023 at 3:29 PM

Your argument neglects the business side of the situation which explains the motivations to allow and disallow use in the two scenarios… if I run a content website, a search engine crawling the site so it can generate search results which send traffic to my site is beneficial to both parties, it’s symbiotic.

Alternatively, if I run a content site that an AI company crawls and then uses to train a model which then negates the need for my site to would-be visitors, it’s parasitic.

gurenkagurenda t1_j9avebb wrote on February 20, 2023 at 4:09 PM

I'm not neglecting anything. I'm asking for some semblance of precision in defining model training out of fair use. The purpose and character of use, and the effect on the market are already factors in fair use decisions, but that's a lot more complicated of an issue than "AI models can't scrape content." It's specific to the application, and even for ChatGPT specifically, it would be pretty murky.

ImSuperHelpful t1_j9awd5k wrote on February 20, 2023 at 4:16 PM

Except that’s what was missing from your original point, but either way I gave you a starting point… if it’s beneficial for both parties and both parties consent (which content site operators do via robot.txt instructions), no one has a problem. In the AI case it’s beneficial to the AI creator/owner but harmful to the content owner since the AI is competing with them by using their content, so it shouldn’t be considered free use.

gurenkagurenda t1_j9b9muc wrote on February 20, 2023 at 5:43 PM

>Except that’s what was missing from your original point

Again, it's not missing from my original point, because my original point was to ask how the commenter above was distinguishing these cases. You've given a possible answer. That's an answer to my question, not a rebuttal.

I don't think that answer is very compelling, though. Arguing that an explicitly unreliable chat bot that hallucinates as often as it tells the truth is somehow a competitor to news media etc. is a tall order.

ImSuperHelpful t1_j9biixs wrote on February 20, 2023 at 6:40 PM

I didn’t present it as a rebuttal, I added important context that was missing from your question that makes the answer much more clear.

And these thing are unreliable now, but Microsoft and others are dumping billions of dollars into making them better and they’re doing it for profit. Waiting around until they’re perfected before fighting against the ongoing unfair use of copyrighted content is a sure fire strategy to losing that fight.

gurenkagurenda t1_j9bl2kk wrote on February 20, 2023 at 6:56 PM

What they're dumping money into now on this front are AI enhanced search engines, which are complimentary to the content they're training on.

ImSuperHelpful t1_j9brj6f wrote on February 20, 2023 at 7:38 PM

That’s all they’ve launched on this front, that doesn’t mean it’s all they’re working on

zutnoq t1_j9bhoq8 wrote on February 20, 2023 at 6:34 PM

Search providers like google don't just show you links though. They also show you potentially relevant excerpts so you often don't even need to go to the linked site to get what you were after, and show previews of images in image search etc.

Determining exactly where to draw the line of what to consider fair-use for things like this is a highly complex and dynamic issue. Web search engines are (by necessity) parasitic as well but that alone neither makes them bad nor illegal.

Parasitic is also not the "bad" counterpart of symbiotic. A symbiotic relationship is simply a parasitic relationship that benefits both parties. Just saying parasitic says nothing about which side(s) would benefit. I think exploitative would be a more appropriate word to use for such relationships.

ImSuperHelpful t1_j9bqacf wrote on February 20, 2023 at 7:30 PM

Those relevant excerpts and similar features have been pretty detrimental to search click through rates in certain areas (they’re known as “no click” searches in the industry)… but the alternative is to block google bots entirely, which isn’t viable if you’re operating a content site since google has an effective monopoly on search. Also, those features do still link out to the content they’re showing on the SERP, whereas the chat ai doesn’t and gives the appearance that it’s the source of the information.

Your point about vocabulary is fair

bairbs t1_j9an2db wrote on February 20, 2023 at 3:12 PM

I'm speaking about using copyrighted art, music, etc. I understand what training is. I also understand the steps companies take to prevent even the perception that they're training on copyrighted material. They either generate pseudo data or purchase entire libraries from stock photo sites. OpenAI and by extension, Microsoft are hoping they can get enough people on their side by saying, "Nothing is copyright if you think about it," so they can do whatever they like.

gurenkagurenda t1_j9ang4f wrote on February 20, 2023 at 3:14 PM

None of what you said addresses anything I said in my comment.

bairbs t1_j9aog2w wrote on February 20, 2023 at 3:21 PM

Because I'm not talking about defining a model, I'm talking about scraping copyrighted material. Why would I change the subject to your strawman argument?

gurenkagurenda t1_j9avgd7 wrote on February 20, 2023 at 4:10 PM

So you think that search engines should be considered illegal copyright infringement? You say that you're just referring to scraping content, which is a necessary part of how a search engine works. So I'm forced to assume that the answer is yes.

bairbs t1_j9axo6n wrote on February 20, 2023 at 4:24 PM

Lol, you're the one bringing search engines into this for some reason. It's a disingenuous argument and way off base from my point, which is why I'm not responding to it. You've also found all my comments and responded to them agressuvely like a good shill

gurenkagurenda t1_j9b8p1r wrote on February 20, 2023 at 5:37 PM

>You've also found all my comments and responded to them agressuvely like a good shill

Are you talking about this? You replied to me.

I mean Jesus Christ. Anyway, I'm done trying to explain to the concept of unintended consequences to you.

yUQHdn7DNWr9 t1_j9bn2ue wrote on February 20, 2023 at 7:09 PM

You don’t need permission to read, memorise, analyse, synthesise, learn from, paraphrase, praise or criticise copyrighted text. You need permission to reproduce it. It isn’t obvious to me that a statistical model would need to reproduce the data it is studying.

UmdieEcke2 t1_j9a5bby wrote on February 20, 2023 at 12:40 PM

Yeah, reading things and then using the information is the most deplorable action any actor can do. Thank god humans are above such disgusting behaviour. Imagine the dystopia we would be living in otherwise.

OpenAI Is Faulted by Media for Using Articles to Train ChatGPT

egypturnash t1_j99o9kf wrote on February 20, 2023 at 8:54 AM