Tumgik
#if ive got anything wildly wrong pls let me know and i'll update
purinrinrin · 3 months
Text
A guide to AI art for artists
When AI art first hit the web I was amazed by the technology. Then later, when it came out that these image generators were trained on images by living artists scraped from the public web with no consent or compensation, my opinion of it was soured. It took a lot of effort for me to push past that distaste in order to properly research the technology so that I could help myself and others to understand it. This is why I’m compiling all the information I’ve found here. I hope you find it helpful.
Terminology
To start off, there are a lot of different terms out there when it comes to AI nowadays so I’m going to try to define some of them so you can understand what people mean when they use them (and so you can tell when they’re full of shit).
AI
Artificial Intelligence. AI is a big buzzword right now in the tech sector and at times feels like it’s being thrown at anything and everything just to attract investors. Cambridge Dictionary defines it as:
the use or study of computer systems or machines that have some of the qualities that the human brain has, such as the ability to interpret and produce language in a way that seems human, recognize or create images, solve problems, and learn from data supplied to them
It’s kind of what it says on the tin - an artificial, that is, human-created system that has abilities similar to those of intelligent life forms. (I’d argue comparing the abilities of AI solely to those of humans does a disservice to the intelligence of many non-human animals but I digress.)
At the moment when you read things online or in the news, AI is likely being used to refer to machine learning which is a type of AI.
Algorithm
The word algorithm describes a process based on a set of instructions or rules used to find a solution to a problem. The term is used in maths as well as computing. For example, the process used to convert a temperature from Fahrenheit to Celsius is a kind of algorithm:
subtract 32
divide by 9
multiply by 5
These instructions must be performed in this specific order.
Nowadays on social media “the algorithm” is used to refer to a specific kind of algorithm - a recommendation algorithm - which is a kind of machine learning algorithm.
Machine Learning
Machine learning is a term used to refer to the the use of a computer algorithm to perform statistical analysis of data (and often large amounts of it) to produce outputs, whether these are images, text or other kinds of data. Social media recommendation algorithms collect data on the kind of content a user has looked at or interacted with before and uses this to predict what other content they might like.
I’ll explain it in very simple terms with an analogy. Consider a maths problem where you have to work out the next number in a sequence. If you have the sequence 2, 4, 6, 8, 10 you can predict that the next number would be 12 based on the preceding numbers each having a difference of 2. When you analyse the data (the sequence of numbers) you can identify a pattern (add 2 each time) then apply that pattern to work out the next number (add 2 to 10 to get 12).
In practice, the kind of analysis machine learning algorithms do is much more complex (social media posts aren’t numbers and don’t have simple relationships with each other like adding or subtracting) but the principle is the same. Work out the pattern in the data and you can then extrapolate from it.
The big downside to these algorithms is that since the rules behind their decision making are not explicitly programmed and are instead based on data it can be difficult to figure out why they produce the outputs they do, making them a kind of “black box” system. When machine learning algorithms are given more and more data, it becomes exponentially harder for humans to reason about their outputs.
Training Data and Models
Another term you’ll come across is “training” or talking about how an AI is “trained”. Training data refers to the data that is used to train the model. The process of training is the statistical analysis and pattern recognition I talked about above. It enables the algorithm to transform a dataset (collections of images and text) into a statistical model that works like a computer program to take inputs (a text prompt) to produce outputs (images).
As a general rule, the bigger the dataset used for training, the more accurate the outputs of the resulting trained model. Once a model is created, the format of the data is completely different to that of the training data. The model is also many orders of magnitude smaller than the original training data.
Text-to-image model AKA AI image generator, generative AI
Text-to-image model is the technical term for these AI image generators:
DALL-E (OpenAI)
Midjourney
Adobe Firefly
Google Imagen
Stable Diffusion (Stability AI)
The technology uses a type of machine learning called deep learning (I won’t go into this here. If you’d like to read more; good luck. It’s very technical). The term text-to-image is simple enough. Given a text prompt, the model will generate an image to match the description.
Stable Diffusion
Stable diffusion is different from other image generators in that its source code is publically available. Anyone with the right skills and hardware can run this. I don’t think I’d be incorrect in saying that this is the main reason why AI art has become so widespread online since stable diffusion’s release in 2022. For better or worse, open-sourcing this code has democratised AI image generation.
I won’t go deep into how stable diffusion actually works because I don’t really understand it myself but I will talk about the process of acquiring training data and training the models it uses to generate images.
What data is used?
I already talked about training data but what actually is it? And where does it come from? In order to answer this I’m going to take you down several rabbit holes.
LAION-5B
Taking stable diffusion as an example, it uses models trained on various datasets made available by German non-profit research group LAION (Large-scale Artificial Intelligence Open Network). The biggest of these datasets is LAION-5B which is refined down to several smaller datasets (~2 billion images) based on language. They describe LAION-5B as “a dataset of 5,85 billion CLIP-filtered image-text pairs”. Okay. What does “CLIP-filtered image-text pairs” mean?
CLIP
OpenAI’s CLIP (Contrastive Language-Image Pre-training) is (you guessed it) another machine learning algorithm that has been trained to label images with the correct text. Given an image of a dog, it should label that image with the word “dog”. It does a little bit more than this as well. When an image is analysed with CLIP it can output a file called an embedding. This embedding contains a list of words or phrases and a confidence score from 0 to 1 based on how confident CLIP is that the text describes the image. An image of a park that happens to show a dog in the background would have a lower confidence score for the text “dog” than a close-up image of a dog. When you get to the section on prompting, it will become clear how this ends up working in image generators.
As I mentioned before, the more images you have in the training data, the better the model will work. The researchers at OpenAI make that clear in their paper on CLIP. They explain how previous research into computer vision didn’t produce very accurate results due to the small datasets used for training, and the datasets were so small because of the huge amount of manual labour involved in curating and labelling them. (The previous dataset they compare CLIP’s performance to, ImageNet, contains a mere 14 million images.) Their solution was to use data from the internet instead. It already exists, there’s a huge amount of it and it’s already labelled thanks to image alt text. The only thing they’d need to do is download it.
It’s not stated in the research paper exactly which dataset CLIP was trained on. All it says is that “CLIP learns from text–image pairs that are already publicly available on the internet.” Though according to LAION, CLIP was trained on an unreleased version of LAION-400M, an earlier text-image pair dataset.
Common Crawl
The data in LAION-5B itself comes from another large dataset made available by the non-profit Common Crawl which “contains raw web page data, metadata extracts, and text extracts” from the publicly accessible web. In order to pull out just the images, LAION scanned through the HTML (the code that makes up each web page) in the Common Crawl dataset to find the bits of the code that represent images (<img> tags) and pulled out the URL (the address where the image is hosted online and therefore downloadable from) and any associated alternative text, or “alt text”.
A tangent on the importance of image alt text
Alt text is often misused on the web. Its intended purpose is to describe images for visually impaired users or if the image is unable to be loaded. Let’s look at an example.
Tumblr media
This image could have the alt text: “A still image from the film Back to the Future III depicting Doc Brown and Marty McFly. They are stood outside facing each other on a very bright sunny day. Doc Brown is trying to reassure a sceptical looking Marty by patting him on the shoulder. Marty is wearing a garish patterned fringed jacket, a red scarf and a white stetson hat. The DeLorean time machine can be seen behind them.” Good. This is descriptive.
But it could also have the alt text: “Christopher Lloyd and Michael J Fox in Back to the Future III” Okay but not very specific.
Or even: “Back to the Future III: A fantastic review by John Smith. Check out my blog!” Bad. This doesn’t describe the image. This text would be better used as a title for a web page.
Alt text can be extremely variable in detail and quality, or not exist at all, which I’m sure will already be apparent to anyone who regularly uses a screen reader to browse the web. This casts some doubt on the accuracy of CLIP analysis and the labelling of images in LAION datasets.
CLIP-filtered image-text pairs
So now, coming back to LAION-5B, we know that “CLIP-filtered image-text pairs” means two things. The images were analysed with CLIP and the embeddings created from this analysis were included in the dataset. Then these embeddings were used to check that the image caption matched what CLIP identified the image as. If there was no match, the image was dropped from the dataset.
But LAION datasets themselves do not contain any images
So how does this work? LAION states on their website:
LAION datasets are simply indexes to the internet, i.e. lists of URLs to the original images together with the ALT texts found linked to those images. While we downloaded and calculated CLIP embeddings of the pictures to compute similarity scores between pictures and texts, we subsequently discarded all the photos. Any researcher using the datasets must reconstruct the images data by downloading the subset they are interested in. For this purpose, we suggest the img2dataset tool.
In order to train a model for use with stable diffusion, you would need to go through a LAION dataset with img2dataset and download all the images. All 240 terabytes of them.
LAION have used this argument to wiggle out of a recent copyright lawsuit. The Batch reported in June 2023:
LAION may be insulated from claims of copyright violation because it doesn’t host its datasets directly. Instead it supplies web links to images rather than the images themselves. When a photographer who contributes to stock image libraries filed a cease-and-desist request that LAION delete his images from its datasets, LAION responded that it has nothing to delete. Its lawyers sent the photographer an invoice for €979 for filing an unjustified copyright claim.
Deduplication
In a dataset it’s usually not desirable to have duplicate entries of the same data, but how do you ensure this when the data you’re processing is as huge as the entire internet? Well… LAION admits you kinda don’t.
There is a certain degree of duplication because we used URL+text as deduplication criteria. The same image with the same caption may sit at different URLs, causing duplicates. The same image with other captions is not, however, considered duplicated.
Another reason why reposting art sucks
If you’ve been an artist online for a while you’ll know all about reposts and why so many artists hate them. From what I’ve seen in my time online, the number of times an artist’s work is reposted on different sites is proportional to their online reach or influence (social media followers, presence on multiple sites etc). The more well known an artist becomes, the more their art is shared and reposted without permission. It may also be reposted legitimately, say if an online news outlet ran a story on them and included examples of their art. Whether consensual or not, this all results in more copies of their art out there on the web and therefore, in the training data. As stated above, if the URL of the image is different (the same image reposted on a different website will have a different URL), to LAION it’s not considered duplicated.
Now it becomes clear how well known digital artists such as Sam Yang and Loish have their styles easily imitated with these models - their art is overrepresented in the training data.
How do I stop my art being used in training data?
Unfortunately for models that have already been trained on historic data from LAION/Common Crawl, there is no way to remove your art and no way to even find out if your art has been used in the training.
Unfortunately again, simply deleting your art from social media sites might not delete the actual image from their servers. It will still be accessible at the same URL as when you originally posted it. You can test this by making an image post on the social media site you want to test. When the image is posted, right click the image and select “open image in new tab”. This will show you the URL of the image in the address bar. Keep this tab open or otherwise keep a record of this URL. Then go back and delete the post. After the post is deleted, often you will still be able to view the image at the URL that you saved.
If you have your own website where you host your art you can delete your images, or update their URLs so that they are no longer accessible from the URLs that were previously in web crawl data.
HTTP Headers
On your own website you can also use the X-Robots-Tag HTTP header to prevent bots from crawling your website for training data. These values can be used:
X-Robots-Tag: noai
X-Robots-Tag: noimageai
X-Robots-Tag: noimageindex
The img2dataset tool is used to download images from datasets made available by LAION. The README states that by default img2dataset will respect the above headers and skip downloading from websites that use them. Although it must be noted this can be overridden, so if an unscrupulous actor wants to scrape your images without your consent, there is no technical reason they cannot do this.
Glaze
If you can’t prevent your images from being crawled, you can prevent all new art that you post from being useful in future models that are trained from scratch by using Glaze. Glaze is a software tool that you can run your art through to protect it from being copied by image generators. It does this by “poisoning” the data in the image that is read by machine learning code while keeping the art looking the same to human eyes.
Watermarks
This defence is a bit of a long shot but worth a try. You may be able to get your art filtered out of training data by adding an obvious watermark. One column included in the LAION dataset is pwatermark which is the probability that the image contains a watermark, calculated by a CLIP model trained on a small subset of clean and watermarked images. Images were then filtered out of subsequent datasets using a threshold for pwatermark of 0.8, which compared to the threshold for NSFW (0.3) and non-matching captions (also 0.3) is pretty high. This means that only images with the most obvious watermarks will be filtered out.
Prompt engineering and how to spot AI art
We’ve covered how AI image generators are trained so now let’s take all that and look at how they work in practice.
Artifacts
You’ve probably gotten annoyed by JPEG compression artifacts or seen other artists whine about them but what is an artifact? A visual artifact is often something unwanted that appears in an image due to technologies used to create it. JPEG compression artifacts appear as solid colour squares or rectangles where there should be a smooth transition from one colour to another. They can also look like fuzziness around high contrast areas of an image.
I’d describe common mistakes in AI image generations as artifacts - they are an unwanted side effect of the technology used to produce the image. Some of these are obvious and pretty easy to spot:
extra or missing fingers or otherwise malformed hands
distorted facial features
asymmetry in clothing design, buttons or zips in odd places
hair turning into clothing and vice versa
nonsense background details or clothing patterning
disconnected horizon line, floor or walls. This often happens when opposite sides are separated by an object in the foreground
Some other artifacts are not strange-looking, but become obvious tells for AI if you have some experience with prompting.
Keyword bleeding
Often if a colour is used in the text prompt, that colour will end up being present throughout the image. If it depicts a character and a background, both elements will contain the colour.
The reason for this should be obvious now that we know how the training data works. This image from LAION demonstrates it nicely:
Tumblr media
This screenshot shows the search page for clip-retrieval which is a search tool that utilises an image-text pair dataset created using CLIP. You will see the search term that was entered is “blue cat” but the images in the results contain not just cats that are blue, but also images of cats that are not blue but there is blue elsewhere in the image eg a blue background, a cat with blue eyes, or a cat wearing a blue hat.
To go on a linguistics tangent for a second, part of the above effect could be due to English not having different adjective forms depending on the noun it’s referring to. For example in German when describing a noun the form of the adjective must match the gender of the noun it’s describing. In German, blue is blau, cat is Katze. “Blue cat” would be “blaue Katze”. Since Katze is feminine, the adjective blau must use the feminine ending e. The word for dog is masculine so blau takes the ending er, making it “blauer Hund”. You get the idea.
When a colour is not mentioned in a prompt, and no keyword in the prompt implies a specific colour or combination of colours, the generated images all come out looking very brown or monochrome overall.
Keyword bleeding can have strange effects depending on the prompt. When using adjectives to describe specific parts of the image in the prompt, both words may bleed into other parts of the image. When I tried including “pointed ears” in a prompt, all the images depicted a character with typical elf ears but the character often also had horns or even animal ears as well.
All this seems obvious when you consider the training data. A character with normal-looking ears wouldn’t usually be described with the word “ears” (unless it was a closeup image showing just the person’s ears) because it’s a normal feature for someone to have. But you probably would mention ears in an image description if the character had unusual ears like an elf or catgirl.
Correcting artifacts
AI artifacts can be corrected however, with a process called inpainting (also known as generative fill). This is done by taking a previously generated image, masking out the area to be replaced, then running the generation process again with the same or slightly modified prompt. It can also be used on non AI generated images. Google Pixel phones use a kind of generative fill to remove objects from photographs. Inpainting is a little more involved than just prompting as it requires editing of the input image and it’s not offered by most free online image generators. It’s what I expect Adobe Firefly will really excel at as it’s already integrated into image editing software (if they can iron out their copyright issues…)
Why AI kinda sucks
Since AI image generation is built on large scale statistical analysis, if you’re looking to generate something specific but uncommon you’re not going to have much luck. For example using “green skin” in a prompt will often generate a character with pale skin but there will be green in other parts of the image such as eye colour and clothing due to keyword bleeding.
No matter how specific you are the generator will never be able to create an image of your original character. You may be able to get something that has the same general vibe, but it will never be consistent between prompts and won’t be able to get fine details right.
There is a type of fine-tuning for stable diffusion models called LoRA (Low-Rank Adaptation) that can be used to generate images of a specific character, but of course to create this, you need preexisting images to use for the training data. This is fine if you want a model to shit out endless images of your favourite anime waifu but less than useless if you’re trying to use AI to create something truly original.
Some final thoughts
The more I play around with stable diffusion the more I realise that the people who use it to pretend to be a human artist with a distinctive style are using it in the most boring way possible. The most fun I’ve personally had with image generation is mixing and matching different “vibes” to churn out ideas I may not have considered for my own art. It can be a really useful tool for brainstorming. Maybe you have a few different things you’re inspired by (eg a clothing style or designer, a specific artist, an architectural style) but don’t know how to combine them. An image generator can do this with ease. I think it’s an excellent tool for artistic research and generating references.
All that being said, I strongly believe use of AI image generation for profit or social media clout is unethical until the use of copyrighted images in training data is ceased.
I understand how this situation has come about. Speaking specifically about LAION-5B the authors say (emphasis theirs):
Our recommendation is … to use the dataset for research purposes. … Providing our dataset openly, we however do not recommend using it for creating ready-to-go industrial products, as the basic research about general properties and safety of such large-scale models, which we would like to encourage with this release, is still in progress.
Use of copyrighted material for research falls under fair use. The problem comes from third parties making use of this research data for commerical purposes, which should be a violation of copyright. So far, litigation against AI companies has not made much progress in settling this.
I believe living artists whose work is used to train AI models must be fairly compensated and the law must be updated to enforce this in a way that protects independent artists (rather than building more armour for huge media companies).
The technology is still new and developing rapidly. Changes to legislation tend to be slow. But I have hope that a solution will be found.
References
“Adobe Firefly - Free Generative AI for Creatives.” Adobe. Accessed 28 Jan 2024.
https://www.adobe.com/uk/products/firefly.html
Andrew. "Stable Diffusion prompt: a definitive guide.” Stable Diffusion Art. 4 Jan 2024.
https://stable-diffusion-art.com/prompt-guide/#Anatomy_of_a_good_prompt
Andrew. “Beginner’s guide to inpainting (step-by-step examples).” Stable Diffusion Art. 24 September 2023.
https://stable-diffusion-art.com/inpainting_basics/
AUTOMATIC1111. “Stable Diffusion web UI. A browser interface based on Gradio library for Stable Diffusion.” Github. Accessed 15 Jan 2024
https://github.com/AUTOMATIC1111/stable-diffusion-webui
“LAION roars.” The Batch newsletter. 7 Jun 2023.
https://www.deeplearning.ai/the-batch/the-story-of-laion-the-dataset-behind-stable-diffusion/
Beaumont, Romain. “Semantic search at billions scale.” Medium. 31 Mar, 2022
https://rom1504.medium.com/semantic-search-at-billions-scale-95f21695689a
Beaumont, Romain. “LAION-5B: A new era of open large-scale multi-modal datasets.” LAION website. 31 Mar, 2022
https://laion.ai/blog/laion-5b/
Beaumont, Romain. “Semantic search with embeddings: index anything.” Medium. 1 Dec, 2020
https://rom1504.medium.com/semantic-search-with-embeddings-index-anything-8fb18556443c
Beaumont, Romain. “img2dataset.” GitHub. Accessed 27 Jan 2024.
https://github.com/rom1504/img2dataset
Beaumont, Romain. “Preparing data for training.” GitHub. Accessed 27 Jan 2024.
https://github.com/rom1504/laion-prepro/blob/main/laion5B/usage_guide/preparing_data_for_training.md
“CLIP: Connecting text and images.” OpenAI. 5 Jan 2021.
https://openai.com/research/clip
“AI.” Cambridge Dictionary. Accessed 27 Jan 2024.
https://dictionary.cambridge.org/dictionary/english/ai?q=AI
“Common Crawl - Overview.” Common Crawl. Accessed 27 Jan 2024.
https://commoncrawl.org/overview
CompVis. “Stable Diffusion. A latent text-to-image diffusion model.” GitHub. Accessed 15 Jan 2024
https://github.com/CompVis/stable-diffusion
duskydreams. “Basic Inpainting Guide.” Civitai. 25 Aug 2023.
https://civitai.com/articles/161/basic-inpainting-guide
Gallagher, James. “What is an Image Embedding?.” Roboflow Blog. 16 Nov 2023.
https://blog.roboflow.com/what-is-an-image-embedding/
"What Is Glaze? Samples, Why Does It Work, and Limitations." Glaze. Accessed 27 Jan 2024.
https://glaze.cs.uchicago.edu/what-is-glaze.html
“Pixel 8 Pro: Advanced Pro Camera with Tensor G3 AI.” Google Store. Accessed 28 Jan 2024.
https://store.google.com/product/pixel_8_pro
Schuhmann, Christoph. “LAION-400-MILLION OPEN DATASET.” 20 Aug 2021.
https://laion.ai/blog/laion-400-open-dataset/
Stability AI. “Stable Diffusion Version 2. High-Resolution Image Synthesis with Latent Diffusion Models.” Github. Accessed 15 Jan 2024
https://github.com/Stability-AI/stablediffusion
11 notes · View notes