yaosio t1_iuz4uo0 wrote on November 4, 2022 at 2:04 AM

There is an argument that co-pilot outputting open source code without credit or the license breaks the license. It will output stuff from open source projects verbatim (I can't find the link, maybe it was in Twitter? I can't back this up.), so this isn't a case where the code is inspired by the code, it really is the code and has to abide by the license.

One solution without messing with co-pilot training or output is to have a second program look at code being generated to see if it's coming from any of the open source projects on gitbub and let the user know so they can abide by the license.

CapaneusPrime t1_iuzgsq3 wrote on November 4, 2022 at 3:39 AM

>There is an argument that co-pilot outputting open source code without credit or the license breaks the license. It will output stuff from open source projects verbatim (I can't find the link, maybe it was in Twitter? I can't back this up.), so this isn't a case where the code is inspired by the code, it really is the code and has to abide by the license.

There is an argument that this doesn't matter (from GitHub's perspective).

It's already been pretty well established that AI can be trained on copyrighted photos without issue.

That said image generating AI can produce works which infringe on copyright. So, Copilot could certainly produce code covered by a license. Which would then possibly lead to the Copilot user being in violation of the license.

That said...

While code is copyrighted, the protections of that copyright aren't absolute.

For instance, I don't think anyone would doubt that there are examples of code under license which includes elements lifted from elsewhere without attribution—stackoverflow, etc—for which the author would not have a valid claim of authorship.

But, even for people who wrote their own code, 100% by scratch, there are limitations.

If the copying is a very small element of the whole it's less likely to be problematic.

If the code represents a standard method of doing something or if there's only a few ways to accomplish what the code does, it's not likely to be able to be copyrightable.

Now, the vast majority of my programming work is done in purely functional programming languages—object-oriented languages have much more opportunity for creative expression. I write a lot of code implementing algorithms, most of which are very complex, and I'd be very hard pressed to justify claiming the copyright on most of the code I write.

Regardless of how clever I think some of my code may be, I'm also certain that any other competent person implementing the same algorithm would end up with code >95% essentially identical to mine.

Honestly, I don't see this lawsuit going anywhere, as I understand it, any copied snippets are fairly short and standard.

Alikont t1_iv0oi1g wrote on November 4, 2022 at 12:33 PM

> includes elements lifted from elsewhere without attribution—stackoverflow

Users of Stackoverflow sign that their code snippets are public and no attribution is required as a part of Stackoverflow TOS

CapaneusPrime t1_iv0t6pr wrote on November 4, 2022 at 1:12 PM

You missed the point, I'm not making a spurious "whataboutism" claim.

Attribution is required by copyright.

If I take a snippet of code from stackoverflow and put it in my open source project, that's fine.

Nobody is saying it isn't.

What isn't fine is slapping a license on a file which includes that code without specifying that that code isn't subject to the license—that's claiming ownership of something which isn't yours and trying to attach a license to it.

Beyond that, you really need to re-read the stackoverflow ToS, because they don't quite say what you seem to think.

Takahashi_Raya t1_iv0ammn wrote on November 4, 2022 at 10:03 AM

>It's already been pretty well established that AI can be trained on copyrighted photos without issue.

It hasn't that is why ghetty has blocked ai and the art world is incredibly hatefull against AI and moving in the same way the creators started this lawsuit. There is a reason why university's have law and ethics classes regarding AI where it is explicitly told to not train on anything that is not public domain or licensed.

The fact facial recognition waa trained on millions of foto's that where present on facebook is still a sore sting in many people's minds. Dont confuse AI startups ignoring ethics and laws with reality.

If this lawsuit is a succes expect the ai tech world to be on fire very quickly. IP lawyers are frothing at their mouths for a while to get a slice of this

farmingvillein t1_iv1x778 wrote on November 4, 2022 at 5:41 PM

> It hasn't that is why ghetty has blocked ai

You are right that OP is wrong (re:whether this is a settled legal issue)...but let's not pretend that ghetty [sic] doing so has to do with anything than attempted revenue maximization on their part.

Successful, prolific AI art devalues their portfolio of images, and they know that.

Takahashi_Raya t1_iv20ewh wrote on November 4, 2022 at 6:01 PM

I mean that is very much part of it but it is indeed not the only reason ghetty did that.

farmingvillein t1_iv29u6c wrote on November 4, 2022 at 7:03 PM

Getty and Shutterstock literally turned around and partnered with generative AI companies--who do exactly what you flag as a problem--to sell images on their platforms.

Takahashi_Raya t1_iv2c6qi wrote on November 4, 2022 at 7:19 PM

Getty and shutterstock partnered with OpenAI (creators of Dall-E) and with BRIA. Both company's who's training data has been confirmed to be ethically sourced and only contain public DOMAIN images and images they have licenses too.

the ones who are under scrutiny from community's are Midjourney, stablediffusion & novelAI. when it comes to image gen due to them not adhering to the ethics in AI data usage.

OpenAI is mentioned in the current main topic of Co-Pilot as well due to microsoft using their codex model as part of co-pilot but that doesn't change that Dall-E is ethically used.

CapaneusPrime t1_iv0npgi wrote on November 4, 2022 at 12:26 PM

You are wrong.

https://towardsdatascience.com/the-most-important-supreme-court-decision-for-data-science-and-machine-learning-44cfc1c1bcaf

Ronny_Jotten t1_iv0vo01 wrote on November 4, 2022 at 1:31 PM

That decision wasn't about copyrighted photos. It was about Google creating a books search index, which was allowed as fair use - just like their scanning of books for previews is. That's an entirely different situation than if Google had trained an AI to write books for sale, that contained snippets or passages from the digitized books.

The latter certainly would not be considered fair use under the reasoning given by the judge in the case. He found that the search algorithm maintained:

> consideration for the rights of authors and other creative individuals, and without adversely impacting the rights of copyright holders

and that its incorporation into the Google Books system works to increase the sales of the copyrighted books by the authors. None of this can be said about Microsoft's product. It would seem to clearly fail the tests for fair use.

CapaneusPrime t1_iv1lheh wrote on November 4, 2022 at 4:26 PM

>That decision wasn't about copyrighted photos.

And every knowledge person agrees this protects images as well.

Training a generative AI does not adversely impact the rights of artists.

This is really transformative fair use.

waffles2go2 t1_iv1c4z3 wrote on November 4, 2022 at 3:25 PM

>https://medium.com/@brianjleeofcl/this-piece-should-be-retracted-ca740d9a36fe

Relevant bits - perhaps spouting off with a N=1 isn't the best look...

In practice, when SCOTUS denies the petition, the ruling made by the relevant appellate court is a legal precedent only within the the district (Second) where the circuit court has made its ruling. This means that a different court—say, the Ninth, which includes Silicon Valley—could go ahead and issue a ruling that directly opposes that of the Second. At this point, it becomes more likely that SCOTUS would grant cert since it would be a problem that under the same federal legal code, two opposing versions of case law could exist; after which the court would hear arguments and then finally issue a decision. Until that hypothetical occurs, there is no precedent set by a SCOTUS decision to note in this matter.

So a programmer who doesn't understand the law should take a harder look at what they post on Reddit unless the like being totally owned...

CapaneusPrime t1_iv1k6gu wrote on November 4, 2022 at 4:18 PM

I'm not a programmer and I do understand the law.

killver t1_iv0rejo wrote on November 4, 2022 at 12:57 PM

> It's already been pretty well established that AI can be trained on copyrighted photos without issue.

This is one of the biggest misconceptions in AI at this point. This is just not true.

[deleted] t1_iv0xj39 wrote on November 4, 2022 at 1:45 PM

[deleted]

Brilliant_Aspect_201 t1_ivdtb7p wrote on November 7, 2022 at 5:48 AM

Using peoples work for ANYTHING without permission is ILLEGAL!

CapaneusPrime t1_iv0t82o wrote on November 4, 2022 at 1:12 PM

https://towardsdatascience.com/the-most-important-supreme-court-decision-for-data-science-and-machine-learning-44cfc1c1bcaf

killver t1_iv0ycz4 wrote on November 4, 2022 at 1:51 PM

If you trust a random blog, go ahead.

This ruling was for a very specific use case that cannot be generalized, and also only applies to US, even only a specific district. It is also totally unclear how it applies to generative model, which even the blog cited recognizes.

The AI community just loves to trust this as it is the easy and convenient thing to do.

Also see a reply to this post you shared: https://medium.com/@brianjleeofcl/this-piece-should-be-retracted-ca740d9a36fe

CapaneusPrime t1_iv1kpa2 wrote on November 4, 2022 at 4:21 PM

👌 Good luck with that.

multiedge t1_iwiakg3 wrote on November 15, 2022 at 9:02 PM

There's also a problem of this being used to scam Microsoft. I mean, I could license my code and publish it on github, then create another several github accounts and reuse that licensed code to be picked up by Copilot. I will then have legal grounds to sue them for using my licensed code.

chatterbox272 t1_iv4kbwb wrote on November 5, 2022 at 6:34 AM

>It will output stuff from open source projects verbatim

I've seen this too, however only in pretty artificial circumstances. Usually in empty projects, and with some combination of exact function names/signatures, detailed comments, or trivially easy blocks that will almost never be unique. I've never seen an example posted in-context (in an existing project with it's own conventions) where this occurred.

>One solution without messing with co-pilot training or output is to have a second program look at code being generated to see if it's coming from any of the open source projects on gitbub and let the user know so they can abide by the license.

This kinda exists, there is a setting to block matching open-source code although reportedly it isn't especially effective (then again, I've only seen this talked about by people who also report frequent copy-paste behaviour, something I've not been able to replicate in normal use).

multiedge t1_iwib7xx wrote on November 15, 2022 at 9:06 PM

There's also a problem of this being a scam. I mean, I could license my code and publish it on github, then create another several github accounts and reuse that licensed code to be picked up by Copilot. I will then have legal grounds to sue them for using my licensed code.

TiredOldCrow t1_iuzp1y3 wrote on November 4, 2022 at 5:00 AM

I appreciate that the legendary "fast inverse square root" code from Quake 3 gets produced verbatim, comments and all, if you start with "float Q_rsqrt".

float Q_rsqrt( float number )
{
	long i;
	float x2, y;
	const float threehalfs = 1.5F;

	x2 = number * 0.5F;
	y  = number;
	i  = * ( long * ) &amp;y;                       // evil floating point bit level hacking
	i  = 0x5f3759df - ( i &gt;&gt; 1 );               // what the fuck? 
	y  = * ( float * ) &amp;i;
	y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//	y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

	return y;
}

I'm interested in how practical it will be for a motivated attacker to poison a code generation models with vulnerable code. Also curious to what extent these models produce code that only works with outdated and vulnerable dependencies -- a problem you'll also run into if you naively copy old StackOverflow posts. I've recently been working on threat models in natural language generation, but it seems like threat models in code generation are also going to be interesting.

Edit: Not John Carmack!

ClearlyCylindrical t1_iv0d7hp wrote on November 4, 2022 at 10:36 AM

the q_rsqrt being produced verbatim is probably due to identical code existing in many areas of the training data.

dojoteef t1_iv0hfoe wrote on November 4, 2022 at 11:25 AM

Slightly off-topic: I'm a huge John Carmack fan, but he isn't the author of that code. It's just part of engine code that his company released for the game Quake 3 Arena. For details, check out:

https://www.beyond3d.com/content/articles/8/

TiredOldCrow t1_iv0s1bg wrote on November 4, 2022 at 1:02 PM

Great read, thanks for that. Updated the comment.

fmai t1_iv0xglf wrote on November 4, 2022 at 1:44 PM

Even if it is judged to be illegal, I hope that countries quickly come to pass new legislation that makes it legal in the future.

race2tb t1_iv3xozn wrote on November 5, 2022 at 2:29 AM

Not going to matter in the longer term. These early models are still in their generative infancy. You will not be able to tell at all in the future where the code came from unless you ask for it verbatim from an known example.

[deleted] t1_iuz456p wrote on November 4, 2022 at 1:59 AM

[deleted]

pseudorandom_user t1_iuzm0en wrote on November 4, 2022 at 4:28 AM

I wonder if anyone is considering adding a clause to their open source license stating that it can't be used to train ai language models.

ReasonablyBadass t1_iuzpvgh wrote on November 4, 2022 at 5:10 AM

Wouldn't be mich better to state any AI derived code from this will automatically be Open Source?

chasingourselves t1_iv0gju2 wrote on November 4, 2022 at 11:15 AM

The problem is that many open source licenses are mutually incompatible— e.g. Affero GPL vs BSD vs GPLv3. So you’d need per-snippet code licensing.

farmingvillein t1_iv1ye88 wrote on November 4, 2022 at 5:49 PM

Which seems like a solvable, albeit terribly painful, problem?

If this is the direction that things end up going, honestly this will ultimately only be massively in the favor of OpenAI (and a small # of very well-funded competitors), as it will create a very, very painful barrier to entry.

pm_me_your_ensembles t1_iv0ea9x wrote on November 4, 2022 at 10:50 AM

You can opt out of Github's source collecting program.

[deleted] t1_iv3f3a3 wrote on November 4, 2022 at 11:58 PM

[removed]

multiedge t1_iwibyuo wrote on November 15, 2022 at 9:11 PM

I could do that, then I would also create several github accounts and reuse my licensed code to be picked up by copilot. I could then sue microsoft for using my code >:)

multiedge t1_iwibhdi wrote on November 15, 2022 at 9:08 PM

Just repeating what I said somewhere else, but there's also a chance this could be a scam as well. I mean, I could license my code and publish it on github, then create another several github accounts and reuse that licensed code to be picked up by Copilot. I will then have legal grounds to sue them for using my licensed code.

multiedge t1_iwibp9t wrote on November 15, 2022 at 9:09 PM

The bigger issue might be that there's movement from this anti-AI group of people who peddle irresponsible and uniformed stuff regarding AI. https://www.youtube.com/watch?v=IQJVWN_-jB8

themrzmaster t1_iv1r1vk wrote on November 4, 2022 at 5:02 PM

Your knowldge comes from open source projects. That does’nt mean you are violating licenses when you write “new” code.

killver t1_iv0z4kt wrote on November 4, 2022 at 1:56 PM

I am even more concerned that they send my non-public / proprietary code back to evaluate the responses and save it to improve the models. I still could not find a clear statement that they are not doing it.

FoundationPM t1_iuzw25q wrote on November 4, 2022 at 6:28 AM

Source code in Github should not be used to train a commercial product. Why would they do that? To benefit the human being? I doubt. It doesn't belong to Github even if under GNU/MIT or any licence.

[deleted] t1_iv0xvpx wrote on November 4, 2022 at 1:47 PM

[deleted]

FranciscoJ1618 t1_iuzn2va wrote on November 4, 2022 at 4:39 AM

The end of programmers is very close, but I think this was going to happen regardless of AI. Programming communities have always acted against their own self-interest, with some kind of cult mindset and ignoring basic economics rules, in particular those related to free software (software libre). They'll learn the hard way that more programmers = lower salary and sharing your source code was a very stupid idea.

Comments