Those claiming AI training on copyrighted works is “theft” misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they’re extracting general patterns and concepts - the “Bob Dylan-ness” or “Hemingway-ness” - not copying specific text or images.
This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in “vector space”. When generating new content, the AI isn’t recreating copyrighted works, but producing new expressions inspired by the concepts it’s learned.
This is fundamentally different from copying a book or song. It’s more like the long-standing artistic tradition of being influenced by others’ work. The law has always recognized that ideas themselves can’t be owned - only particular expressions of them.
Moreover, there’s precedent for this kind of use being considered “transformative” and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.
While it’s understandable that creators feel uneasy about this new technology, labeling it “theft” is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn’t make the current use of copyrighted works for AI training illegal or unethical.
For those interested, this argument is nicely laid out by Damien Riehl in FLOSS Weekly episode 744. https://twit.tv/shows/floss-weekly/episodes/744
If they can base their business on stealing, then we can steal their AI services, right?
Pirating isn’t stealing but yes the collective works of humanity should belong to humanity, not some slimy cabal of venture capitalists.
Unlike regular piracy, accessing “their” product hosted on their servers using their power and compute is pretty clearly theft. Morally correct theft that I wholeheartedly support, but theft nonetheless.
Is that how this technology works? I’m not the most knowledgeable about tech stuff honestly (at least by Lemmy standards).
There’s self-hosted LLMs, (e.g. Ollama), but for the purposes of this conversation, yeah - they’re centrally hosted, compute intensive software services.
Also, ingredients to a recipe aren’t covered under copyright law.
ingredients to a recipe may well be subject to copyright, which is why food writers make sure their recipes are “unique” in some small way. Enough to make them different enough to avoid accusations of direct plagiarism.
E: removed unnecessary snark
This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages.
Machine learning algorithms are not people and are not ingesting these works the same way a person does. This argument is brought up all the time and just doesn’t ring true. You’re defending the unethical use of copyrighted works by a giant corporation with a metaphor that doesn’t have any bearing on reality; in an age where artists are already shamefully undervalued. Creating art is a human process with the express intent of it being enjoyed by other humans. Having an algorithm do it is removing the most important part of art; the humanity.
If ChatGPT was free I might see their point but it’s not so no. If you’re making money from someone’s work you should pay them.
You’re making an indie movie on your iPhone with friends. You sell one ticket. You now owe: Apple, Joseph Nicéphore Niépce’s estate (inventor of the camera), every cinematographer who first devised the type of shots you’re using, the writers since the beginning of time that created the types of story elements in the script, the mathematicians and scientists that developed lense technology, the car manufacturers that aided your ability to transport you to the set, the guy who’s YouTube tutorial you watched to figure out lighting, etc, etc, etc.
Your black and white framing appears to provide a clear ethical framework until you dig a millimeter into it. The reality is that society only exists because of the work that all of the individuals within it produce. Things like copyright are an adapter to our capitalistic economy to ensure people’s work that can be copied, are protected enough that they have the opportunity to make money off of it. It exists so somebody else can’t immediately turn around and sell the same book someone else wrote, or just change a few words and do as such. This protection was meant to last 15 to 20 years. Then enter the public domain for anyone to copy and rewrite as they please.
Current copyright is an utter bastardization of its intended use. Massive corporations are trying to act like they’re fighting for the little guy to own their IP forever. But they buy up all that IP for pennies compared to how they turn around and commoditize it. Then they own all of what society produces in perpetuity. They can sit on their dragon hoards and laugh as they gobble up any new creation that strays too close. And people wonder why everything is a sequel of a sequel of a sequel owned by massive corporations.
I was trying to keep it simple.
I would have paid them by purchasing the iphone and whatever software I used. I paid for the car that transported me. I would have paid for my education. People can also give their work away for free if they want, or be compensated by ads as in the case of Youtube or FOSS.
Current copyright is an utter bastardization of its intended use. Massive corporations are trying to act like they’re fighting for the little guy to own their IP forever. But they buy up all that IP for pennies compared to how they turn around and commoditize it. Then they own all of what society produces in perpetuity. They can sit on their dragon hoards and laugh as they gobble up any new creation that strays too close. And people wonder why everything is a sequel of a sequel of a sequel owned by massive corporations.
What do you think ChatGPT is trying to do? It’s already being used to churn out shitloads of garbage content. They’re not making things better.
By that rationalization, OpenAI is paying their Internet bill, and for a copy of Dune, so they’re free to use any content they acquired to make their product better. Your original argument wasn’t akin to, “Shouldn’t someone using an iPhone pay for one?” It was “Shouldn’t Apple get a cut of everything made with the iPhone?”
You could make the argument that people use ChatGPT to churn out garbage content, sure, but a lot of cinephiles would accuse your proverbial indie movie of being the same and blame Apple for creating the iPhone and enabling it. If you want to make that argument, go ahead. But don’t pretend it has anything to do with people getting paid fairly for what they made.
ChatGPT is enabling people to make more things, easier, to get paid. And people, as always, are relying on everything that was created before them as a basis for their work. Same as when I go to school and the professor shows me lots of different works to learn from. The thousands of students in that class didn’t pay for any of that stuff. The professor distilled it and presented it and I paid him to do it.
The problem is that they didn’t pay for the content they’ve acquired and they’re selling it to others. The creators are not being compensated and may not want to participate in AI development at all. If the creators agree to it then fine but most do not. Just look at what’s happening with art. People are scraping all of an artists work to create AI pictures in their style and impersonate them. That’s not okay.
Look… All I have to say is… Support the Internet Archive!
(please)
This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages.
Like fuck it is. An LLM “learns” by memorization and by breaking down training data into their component tokens, then calculating the weight between these tokens. This allows it to produce an output that resembles (but may or may not perfectly replicate) its training dataset, but produces no actual understanding or meaning–in other words, there’s no actual intelligence, just really, really fancy fuzzy math.
Meanwhile, a human learns by memorizing training data, but also by parsing the underlying meaning and breaking it down into the underlying concepts, and then by applying and testing those concepts, and mastering them through practice and repetition. Where an LLM would learn “2+2 = 4” by ingesting tens or hundreds of thousands of instances of the string “2+2 = 4” and calculating a strong relationship between the tokens “2+2,” “=,” and “4,” a human child would learn 2+2 = 4 by being given two apple slices, putting them down to another pair of apple slices, and counting the total number of apple slices to see that they now have 4 slices. (And then being given a treat of delicious apple slices.)
Similarly, a human learns to draw by starting with basic shapes, then moving on to anatomy, studying light and shadow, shading, and color theory, all the while applying each new concept to their work, and developing muscle memory to allow them to more easily draw the lines and shapes that they combine to form a whole picture. A human may learn off other peoples’ drawings during the process, but at most they may process a few thousand images. Meanwhile, an LLM learns to “draw” by ingesting millions of images–without obtaining the permission of the person or organization that created those images–and then breaking those images down to their component tokens, and calculating weights between those tokens. There’s about as much similarity between how an LLM “learns” compared to human learning as there is between my cat and my refrigerator.
And YET FUCKING AGAIN, here’s the fucking Google Books argument. To repeat: Google Books used a minimal portion of the copyrighted works, and was not building a service to compete with book publishers. Generative AI is using the ENTIRE COPYRIGHTED WORK for its training set, and is building a service TO DIRECTLY COMPETE WITH THE ORGANIZATIONS WHOSE WORKS THEY ARE USING. They have zero fucking relevance to one another as far as claims of fair use. I am sick and fucking tired of hearing about Google Books.
If you put a gazillion monkeys on a typewriter they can write Shakespeare.
If you train one ai for a ton of epochs it can write Shakespeare.
All pure mathematical coincidence.
If you put a gazillion monkeys on a typewriter they can write Shakespeare.
This is a mathematical curiosity borne out of pure randomness. An LLM trained on a dataset to generate similar content is quite the opposite of randomness.
The whole point of copyright in the first place, is to encourage creative expression, so we can have human culture and shit.
The idea of a “teensy” exception so that we can “advance” into a dark age of creative pointlessness and regurgitated slop, where humans doing the fun part has been made “unnecessary” by the unstoppable progress of “thinking” machines, would be hilarious, if it weren’t depressing as fuck.
The whole point of copyright in the first place, is to encourage creative expression
…within a capitalistic framework.
Humans are creative creatures and will express themselves regardless of economic incentives. We don’t have to transmute ideas into capital just because they have “value”.
Sorry buddy, but that capitalistic framework is where we all have to exist for the forseeable future.
Giving corporations more power is not going to help us end that.
I don’t think they’re advocating for more capitalism.
I’d agree, but here’s one issue with that: we live in reality, not in a post-capitalist dreamworld.
Creativity takes up a lot of time from the individual, while a lot of us are already working two or even three jobs, all on top of art. A lot of us have to heavily compromise on a lot of things, or even give up our dreams because we don’t have the time for that. Sure, you get the occasional “legendary metal guitarist practiced so much he even went to the toilet with a guitar”, but many are so tired from their main job, they instead just give up.
Developing game while having a full-time job feels like crunching 24/7, while only around 4 is going towards that goal, which includes work done on my smartphone at my job. Others just outright give up. This shouldn’t be the normal for up and coming artists.
Honestly, that’s why open source AI is such a good thing for small creatives. Hate it or love it, anyone wielding AI with the intention to make new expression will be much more safe and efficient to succeed until they can grow big enough to hire a team with specialists. People often look at those at the top but ignore the things that can grow from the bottom and actually create more creative expression.
One issue is, many open source AI also tries to ape whatever the big ones are doing at the moment, with the most outrageous example is one that generates a timelapse for AI art.
There’s also tools that especially were created with artists in mind, but they’re less popular due to the average person cannot use it as easily as the prompter machines, nor promise the end of “people with fake jobs” (boomers like generative AI for this reason).
Humans are indeed creative by nature, we like making things. What we don’t naturally do is publish, broadcast and preserve our work.
Society is iterative. What we build today, we build mostly out of what those who came before us built. We tell our versions of our forefathers’ stories, we build new and improved versions of our forefather’s machines.
A purely capitalistic society would have infinite copyright and patent durations, this idea is mine, it belongs to me, no one can ever have it, my family and only my family will profit from it forever. Nothing ever improves because improving on an old idea devalues the old idea, and the landed gentry can’t allow that.
A purely communist society immediately enters whatever anyone creates into the public domain. The guy who revolutionizes energy production making everyone’s lives better is paid the same as a janitor. So why go through all the effort? Just sweep the floors.
At least as designed, our idea of copyright is a compromise. If you have an idea, we will grant you a limited time to exclusively profit from your idea. You may allow others to also profit at your discretion; you can grant licenses, but that’s up to you. After the time is up, your idea enters the public domain, and becomes the property and heritage of humanity, just like the Epic of Gilgamesh. Others are free to reproduce and iterate upon your ideas.
I think you have your janitor example backwards. Spending my time revolutionizing energy productions sounds much more enjoyable than sweeping floors. Same with designing an effective floor sweeping robot.
That’s the reason we got copyright, but I don’t think that’s the only reason we could want copyright.
Two good reasons to want copyright:
- Accurate attribution
- Faithful reproduction
Accurate attribution:
Open source thrives on the notion that: if there’s a new problem to be solved, and it requires a new way of thinking to solve it, someone will start a project whose goal is not just to build new tools to solve the problem but also to attract other people who want to think about the problem together.
If anyone can take the codebase and pretend to be the original author, that will splinter the conversation and degrade the ability of everyone to find each other and collaborate.
In the past, this was pretty much impossible because you could check a search engine or social media to find the truth. But with enshittification and bots at every turn, that looks less and less guaranteed.
Faithful reproduction:
If I write a book and make some controversial claims, yet it still provokes a lot of interest, people might be inclined to publish slightly different versions to advance their own opinions.
Maybe a version where I seem to be making an abhorrent argument, in an effort to mitigate my influence. Maybe a version where I make an argument that the rogue publisher finds more palatable, to use my popularity to boost their own arguments.
This actually happened during the early days of publishing, by the way! It’s part of the reason we got copyright in the first place.
And again, it seems like this would be impossible to get away with now, buuut… I’m not so sure anymore.
—
Personally:
I favor piracy in the sense that I think everyone has a right to witness culture even if they can’t afford the price of admission.
And I favor remixing because the cultural conversation should be an active read-write two-way street, no just passive consumption.
But I also favor some form of licensing, because I think we have a duty to respect the integrity of the work and the voice of the creator.
I think AI training is very different from piracy. I’ve never downloaded a mega pack of songs and said to my friends “Listen to what I made!” I think anyone who compares OpenAI to pirates (favorably) is unwittingly helping the next set of feudal tech lords build a wall around the entirety of human creativity, and they won’t realize their mistake until the real toll booths open up.
I think AI training is very different from piracy. I’ve never downloaded a mega pack of songs and said to my friends “Listen to what I made!”
I’ve never done this. But I have taken lessons from people for instruments, listened to bands I like, and then created and played songs that certainly are influences by all of that. I’ve also taken a lot of art classes, and studied other people’s painting styles and then created things from what I’ve learned, and said “look at what I made!” Which is far more akin to what AI is doing that what you are implying here.
So what if its closer? Its still not an accurate description, because thats not what AI does.
Because what they are describing is just straight up theft, while what I describes is so much closer to how one trains and ai. I’m afraid that what comes out of this ai hysteria is that copyright gets more strict and humans copying style even becomes illegal.
Well that all doesn’t matter much. If AI is used to cause harm, it should be regulated. If that frustrates you then go get the laws changed that allow shitty companies to ruin good ideas.
I’m sympathetic to the reflexive impulse to defend OpenAI out of a fear that this whole thing results in even worse copyright law.
I, too, think copyright law is already smothering the cultural conversation and we’re potentially only a couple of legislative acts away from having “property of Disney” emblazoned on our eyeballs.
But don’t fall into their trap of seeing everything through the lens of copyright!
We have other laws!
We can attack OpenAI on antitrust, likeness rights, libel, privacy, and labor laws.
Being critical of OpenAI doesn’t have to mean siding with the big IP bosses. Don’t accept that framing.
Disagree. These companies are exploiting an unfair power dynamic they created that people can’t say no to, to make an ungodly amount of money for themselves without compensating people whose data they took without telling them. They are not creating a cool creative project that collaboratively comments on or remixes what other people have made, they are seeking to gobble up and render irrelevant everything that they can, for short term greed. That’s not the scenario these laws were made for. AI hurts people who have already been exploited and industries that have already been decimated. Copyright laws were not written with this kind of thing in mind. There are potentially cool and ethical uses for AI models, but open ai and google are just greed machines.
Edited * THRICE because spelling. oof.
You drank the kool-aid.
This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages.
Many people quote this part saying that this is not the case and this is the main reason why the argument is not valid.
Let’s take a step back and not put in discussion how current “AI” learns vs how human learn.
The key point for me here is that humans DO PAY (or at least are expected to…) to use and learn from copyrighted material. So if we’re equating “AI” method of learning with humans’, both should be subject to the the same rules and regulations. Meaning that “AI” should pay for using copyrighted material.
Do we expect people to pay to learn from copyrighted but freely accessible works?
In general — yes. Most of the time they do so by subjecting their eyeballs or ears to ads. Do you think it’s a good idea to flood AI models with ads as well?
don’t humans normally use adblockers? Or the library?
The vast majority do not. We’re in a pretty tech savvy bubble here on Lemmy.
Point is that accessing a website with an adblocker has never been considered a copyright violation.
As others have said, it isn’t inspired always, sometimes it literally just copies stuff.
This feels like it was written by someone who invested their money in AI companies because they’re worried about their stocks
It dosen’t copy, it’s abstract them into math, find relationships between them and the came back.
It’s not the same at what humans do, but is not just copying neither.
It’s pretty much copying lol. It has no idea about patents, or unique ideas. It basically just takes every unique idea and pretends it invented them because it doesn’t understand
And that’s the problem
It’s basically just something that tries to take credit for everything.
The problem with your argument is that it is 100% possible to get ChatGPT to produce verbatim extracts of copyrighted works. This has been suppressed by OpenAI in a rather brute force kind of way, by prohibiting the prompts that have been found so far to do this (e.g. the infamous “poetry poetry poetry…” ad infinitum hack), but the possibility is still there, no matter how much they try to plaster over it. In fact there are some people, much smarter than me, who see technical similarities between compression technology and the process of training an LLM, calling it a “blurry JPEG of the Internet”… the point being, you wouldn’t allow distribution of a copyrighted book just because you compressed it in a ZIP file first.
ML techniques have been very useful in compression, yes, but it’s sort of nuts to say that a data structure that encodes only (sometimes overly so for certain regions of its latent space/embedding space/semantics space/whatever you want to call it right now) relationships between values rather than value sequences themselves as storing contiguous copyright protected works is storing partiularized creative works in particularly identifiable manner.
Except that, again, as is literally written in the comment you’re directly replying to, it has been shown that AI can reproduce copyrightable works word for word, showing that it objectively and necessarily is storing particular creative works in a particularly identifiable manner, whether or not that manner is yet known to humans.
No, it isn’t storing that information in that sequence. What is happening is that it is overly encoding those particular sequential relationships along some arbitrary but tightly mapped semantic concepts represented by dimensions in a massive vector space. It is storing copies of the information on the way that inadvertent copying of music might be based on “memorized” music listened to by the infringing artist in the past.
Not what I said. I used the exact language the above commenter used because it was specific and accurate. Also, inadvertent copyright violation is still copyright violation under US law. I’m not the biggest fan of every application of that law, but the ability to keep large corporations from ripping off small artists and creators is one that I think is good and useful under the global economic system we live under currently.
Yes, inadvertent copying is still copying, but it would be copying in the output and is not evidence of copying happening in the creation of the model. That was why I used the music example, because it is rather probative of where there could be grounds for copyright infringement related to these model architectures. This may not seem an important distinction, but it has significant consequences on who is ultimately liable and how.
This would be a good point, if this is what the explicit purpose of the AI was. Which it isn’t. It can quote certain information verbatim despite not containing that data verbatim, through the process of learning, for the same reason we can.
I can ask you to quote famous lines from books all day as well. That doesn’t mean that you knowing those lines means you infringed on copyright. Now, if you were to put those to paper and sell them, you might get a cease and desist or a lawsuit. Therein lies the difference. Your goal would be explicitly to infringe on the specific expression of those words. Any human that would explicitly try to get an AI to produce infringing material… would be infringing. And unknowing infringement… well there are countless court cases where both sides think they did nothing wrong.
You don’t even need AI for that, if you followed the Infinite Monkey Theorem and just happened to stumble upon a work falling under copyright, you still could not sell it even if it was produced by a purely random process.
Another great example is the Mona Lisa. Most people know what it looks like and if they had sufficient talent could mimic it 1:1. However, there are numerous adaptations of the Mona Lisa that are not infringing (by today’s standards), because they transform the work to the point where it’s no longer the original expression, but a re-expression of the same idea. Anything less than that is pretty much completely safe infringement wise.
You’re right though that OpenAI tries to cover their ass by implementing safeguards. Which is to be expected because it’s a legal argument in court that once they became aware of situations they have to take steps to limit harm. They can indeed not prevent it completely, but it’s the effort that counts. Practically none of that kind of moderation is 100% effective. Otherwise we’d live in a pretty good world.
Y’all should really stop expecting people to buy into the analogy between human learning and machine learning i.e. “humans do it, so it’s okay if a computer does it too”. First of all there are vast differences between how humans learn and how machines “learn”, and second, it doesn’t matter anyway because there is lots of legal/moral precedent for not assigning the same rights to machines that are normally assigned to humans (for example, no intellectual property right has been granted to any synthetic media yet that I’m aware of).
That said, I agree that “the model contains a copy of the training data” is not a very good critique–a much stronger one would be to simply note all of the works with a Creative Commons “No Derivatives” license in the training data, since it is hard to argue that the model checkpoint isn’t derived from the training data.
a much stronger one would be to simply note all of the works with a Creative Commons “No Derivatives” license in the training data, since it is hard to argue that the model checkpoint isn’t derived from the training data.
Not really. First of all, creative commons strictly loosens the copyright restrictions on a work. The strongest license is actually no explicit license i.e. “All Rights Reserved.” No derivatives is already included under full, default, copyright.
Second, derivative has a pretty strict legal definition. It’s not enough to say that the derived work was created using a protected work, or even that the derived work couldn’t exist without the protected work. Some examples: create a word cloud of your favorite book, analyze the tone of news article to help you trade stocks, or produce an image containing the most prominent color in every frame of a movie, create a search index of the words found on all websites on the internet. All of that is absolutely allowed under even the strictest of copyright protections.
Statistical analysis of copyrighted materials, as in training AI, easily clears that same bar.
The problem with your argument is that it is 100% possible to get ChatGPT to produce verbatim extracts of copyrighted works.
Exactly! This is the core of the argument The New York Times made against OpenAI. And I think they are right.
The examples they provided were for very widely distributed stories (i.e. present in the data set many times over). The prompts they used were not provided. How many times they had to prompt was not provided. Their results are very difficult to reproduce, if not impossible, especially on newer models.
I mean, sure, it happens. But it’s not a generalizable problem. You’re not going to get it to regurgitate your Lemmy comment, even if they’ve trained on it. You can’t just go and ask it to write Harry Potter and the goblet of fire for you. It’s not the intended purpose of this technology. I expect it’ll largely be a solved problem in 5-10 years, if not sooner.
Fully agree. I understand why there are many technological doomers out there and I think AI may be the most deserving of a critical eye. But the immense benefits of being able to manufacture intelligence is undeniable. That NECESSITATES the AI being able to observe anything and everything in the world that it can. That’s how any known intelligence has ever learned and there’s no scientific basis for an intelligence coming into existence knowing everything about the world without it ever being taught about it.
Now I’ve heard a lot of criticism of AI. Some really legitimate concerns about their place in the future (and ours). As well as the ethics of this important technology originating in the private hands of mega corps that historically have not had our best interest at heart. But the VAST majority of criticism has been about how it’s not useful or is just an avenue for copyright abuse. Which at best, is just completely missing the point. But at worst, is the thinly vailed protests of people made very uncomfortable that the status quo is being upset.
Bullshit. AI are not human. We shouldn’t treat them as such. AI are not creative. They just regurgitate what they are trained on. We call what it does “learning”, but that doesn’t mean we should elevate what they do to be legally equal to human learning.
It’s this same kind of twisted logic that makes people think Corporations are People.
Here’s an experiment for you to try at home. Ask an AI model a question, copy a sentence or two of what they give back, and paste it into a search engine. The results may surprise you.
And stop comparing AI to humans but then giving AI models more freedom. If I wrote a paper I’d need to cite my sources. Where the fuck are your sources ChatGPT? Oh right, we’re not allowed to see that but you can take whatever you want from us. Sounds fair.
Can you just give us the TLDE?
AI Chat bots copy/paste much of their “training data” verbatim.
It’s not a breach of copyright or other IP law not to cite sources on your paper.
Getting your paper rejected for lacking sources is also not infringing in your freedom. Being forced to pay damages and delete your paper from any public space would be infringement of your freedom.
Why wouldn’t they charge their so many corporate customers more? They supposedly are providing their services to US government and military, just charge them extra and pay the publishers.
They intentionally keep their prices lower to out-compete other companies and then complain about it. If they put their actual cost to their customers, you would realize how quickly they will lose the market because open source models would out compete them