How AI poisoning is fighting bots that hoover data without permission

- Advertisement -


Renaud Vigourt

Gone are the days when the web was dominated by humans posting social media updates or exchanging memes. Earlier this year, for the first time since the data has been tracked, web-browsing bots, rather than humans, accounted for the bulk of web traffic.

Well over half of that bot traffic is from malicious bots, hoovering up personal data left unprotected online, for instance. But an increasing proportion comes from bots sent out by artificial intelligence companies to gather data for their models or respond to user prompts. Indeed, ChatGPT-User, a bot powering OpenAI’s ChatGPT, is now responsible for 6 per cent of all web traffic, while ClaudeBot, an automated system developed by AI company Anthropic, accounts for 13 per cent.

Read more

The AI expert who says artificial general intelligence is nonsense

The AI companies say such data scraping is vital to keep their models up to date. Content creators feel differently, however, seeing AI bots as tools for copyright infringement on a grand scale. Earlier this year, for example, Disney and Universal sued AI company Midjourney, arguing that the tech firm’s image generator plagiarises characters from popular franchises like Star Wars and Despicable Me.

Few content creators have the money for lawsuits, so some are adopting more radical methods of fighting back. They are using online tools that make it harder for AI bots to find their content – or that manipulate it in a way that tricks bots into misreading it, so that the AI begins to confuse images of cars with images of cows, for example. But while this “AI poisoning” can help content creators protect their work, it might also inadvertently make the web a more dangerous place.

Copyright infringement

For centuries, copycats have made a quick profit by mimicking the work of artists. It is one reason why we have intellectual property and copyright laws. But the arrival in the past few years of AI image generators such as Midjourney or OpenAI’s DALL-E has supercharged the issue.

A central concern in the US is what is known as the fair use doctrine. This allows samples of copyrighted material to be used under certain conditions without requesting permission from the copyright holder. Fair use law is deliberately flexible, but at its heart is the idea that you can use an original work to create something new, provided it is altered enough and doesn’t have a detrimental market effect on the original work.

Many artists, musicians and other campaigners argue that AI tools are blurring the boundary between fair use and copyright infringement to the cost of content creators. For instance, it isn’t necessarily detrimental for someone to draw a picture of Mickey Mouse in, say, The Simpsons’ universe for their own entertainment. But with AI, it is now possible for anyone to spin up large numbers of such images quickly and in a manner where the transformative nature of what they have done is questionable. Once they have made these images, it would be easy to produce a range of T-shirts based on them, for example, which would cross from personal to commercial use and breach the fair use doctrine.

Keen to protect their commercial interests, some content creators in the US are taking legal action. The Disney and Universal lawsuit against Midjourney, launched in June, is just the latest example. Others include an ongoing legal battle between The New York Times and OpenAI over alleged unauthorised use of the newspaper’s stories.

CPR5D2 Le Roi Lion

Disney sued AI company Midjourney over its image generator, which they say plagiarises Disney characters

Photo 12/Alamy

The AI companies strongly deny any wrongdoing, insisting that data scraping is permissible under the fair use doctrine. In an open letter to the US Office of Science and Technology Policy in March, OpenAI’s chief global affairs officer, Chris Lehane, warned that rigid copyright rules elsewhere in the world, where there have been attempts to provide stronger copyright protections for content creators, “are repressing innovation and investment”. OpenAI has previously said it would be “impossible” to develop AI models that meet people’s needs without using copyrighted work. Google takes a similar view. In an open letter also published in March, the company said, “Three areas of law can impede appropriate access to data necessary for training leading models: copyright, privacy, and patents.”

However, at least for the moment, it seems the campaigners have the court of public opinion on their side. When the site IPWatchdog analysed public responses to an inquiry about copyright and AI by the US Copyright Office, it found that 91 per cent of comments contained negative sentiments about AI.

What may not help AI firms gain public sympathy is a suspicion that their bots are sending so much traffic to some websites that they are straining resources and perhaps even forcing some websites to go offline – and that content creators are powerless to stop them. For instance, there are techniques content creators can use to opt out of having bots crawl their websites, including reconfiguring a small file at the heart of the website to say that bots are banned. But there are indications that bots can sometimes disregard such requests and continue crawling anyway.

AI data poisoning

It is little wonder, then, that new tools are being made available to content creators that offer stronger protection against AI bots. One such tool was launched this year by Cloudflare, an internet infrastructure company that provides its users protection against distributed denial-of-service (DDoS) attacks, in which an attacker floods a web server with so much traffic that it knocks the site itself offline. To combat AI bots that may pose their own DDoS-like risk, Cloudflare is fighting fire with fire: it produces a maze of AI-generated pages full of nonsense content so that AI bots expend all their time and energy looking at the nonsense, rather than the actual information they seek.

The tool, known as AI Labyrinth, is designed to trap the 50 billion requests a day from AI crawlers that Cloudflare says it encounters on the websites within its network. According to Cloudflare, AI Labyrinth should “slow down, confuse, and waste the resources of AI crawlers and other bots that don’t respect ‘no crawl’ directives”. Cloudflare has since released another tool, which asks AI companies to pay to access websites, or else be blocked from crawling its content.

Read more

How to fix computing’s AI energy problem: run everything backwards

An alternative is to allow the AI bots access to online content – but to subtly “poison” it in such a way that it renders the data less useful for the bot’s purposes. The tools Glaze and Nightshade, developed at the University of Chicago, have become central to this form of resistance. Both are free to download from the university’s website and can run on a user’s computer.

Glaze, released in 2022, functions defensively, applying imperceptible, pixel-level alterations, or “style cloaks”, to an artist’s work. These changes, invisible to humans, cause AI models to misinterpret the art’s style. For example, a watercolour painting might be perceived as an oil painting. Nightshade, published in 2023, is a more offensive tool that poisons image data – again, imperceptibly as far as humans are concerned – in a way that encourages an AI model to make an incorrect association, such as learning to link the word “cat” with images of dogs. Both tools have been downloaded more than 10 million times.

Figure 7. Examples of images generated by the Nightshade-poisoned SD-XL models and the clean SD-XL model, when prompted with the poisoned concept

The Nightshade tool gradually poisons AI bots so that they represent dogs as cats

Ben Y. Zhao

The AI poisoning tools put power back in the hands of artists, says Ben Zhao at the University of Chicago, who is the senior researcher behind both Glaze and Nightshade. “These are trillion-dollar market-cap companies, literally the biggest companies in the world, taking by force what they want,” he says.

Using tools like Zhao’s is a way for artists to exert the little power they have over how their work is used. “Glaze and Nightshade are really interesting, cool tools that show a neat method of action that doesn’t rely on changing regulations, which can take a while and might not be a place of advantage for artists,” says Jacob Hoffman-Andrews at the Electronic Frontier Foundation, a US-based digital rights non-profit.

The idea of self-sabotaging content to try to ward off alleged copycats isn’t new, says Eleonora Rosati at Stockholm University in Sweden. “Back in the day, when there was a large unauthorised use of databases – from telephone directories to patent lists – it was advised to put in some errors to help you out in terms of evidence,” she says. For instance, a cartographer might deliberately include false place names on their maps. If those false names then appear later in a map produced by a rival, it would provide clear evidence of plagiarism. The practice still makes headlines today: music lyrics website Genius claimed to have inserted different types of apostrophes into its content, which it alleged showed that Google had been using its content without permission. Google denies the allegations, and a court case filed by Genius against Google was dismissed.

Even calling it “sabotage” is debatable, according to Hoffman-Andrews. “I don’t think of it as sabotage necessarily,” he says. “These are the artist’s own images that they are applying their own edits to. They’re fully free to do what they want with their data.”

It is unknown to what extent AI companies are taking their own countermeasures to try to combat this poisoning of the well, either by ignoring any content that is marked with the poison or trying to remove it from the data. But Zhao’s attempts to break his own system showed that Glaze was still 85 per cent effective against all countermeasures he could think of taking, suggesting that AI companies may conclude that dealing with poisoned data is more trouble than it’s worth.

Spreading fake news

However, it isn’t just artists with content to protect who are experimenting with poisoning the well against AI. Some nation-states may be using similar principles to push false narratives. For instance, US-based think tank the Atlantic Council claimed earlier this year that Russia’s Pravda news network – whose name means “truth” in Russian – has used poisoning to trick AI bots into disseminating fake news stories.

Pravda’s approach, as alleged by the think tank, involves posting millions of web pages, sort of like Cloudflare’s AI Labyrinth. But in this case, the Atlantic Council says the pages are designed to look like real news articles and are being used to promote the Kremlin’s narrative about Russia’s war in Ukraine. The sheer volume of stories could lead AI crawlers to over-emphasise certain narratives when responding to users, and an analysis published this year by US technology firm NewsGuard, which tracks Pravda’s activities, found that 10 major AI chatbots outputted text in line with Pravda’s views in a third of cases.

An image of a man brushing his teeth with two toothbrushes, one of which looks strange, that has been generated by an AI program

How to avoid being fooled by AI-generated misinformation

Advances in generative AI mean fake images, videos, audio and bots are now everywhere. But studies have revealed the best ways to tell if something is real

The relative success in shifting conversations highlights the inherent problem with all things AI: technology tricks used by good actors with good intentions can always be co-opted by bad actors with nefarious goals.

There is, however, a solution to these problems, says Zhao – although it may not be one that AI companies are willing to consider. Instead of indiscriminately gathering whatever data they can find online, AI companies could enter into formal agreements with legitimate content providers and ensure that their products are trained using only reliable data. But this approach carries a price, because licensing agreements can be costly. “These companies are unwilling to license these artists’ works,” says Zhao. “At the root of all this is money.”

Topics:

  • artificial intelligence/
  • ChatGPT
FacebookTwitterEmailLinkedInPinterestWhatsAppTumblrCopy LinkTelegramRedditMessageShare
- Advertisement -
FacebookTwitterEmailLinkedInPinterestWhatsAppTumblrCopy LinkTelegramRedditMessageShare
error: Content is protected !!
Exit mobile version