Friday, February 14, 2025

For AI art, word order matters, but less than I expected

 kw: art generation, ai art, experiments, word order, semantics

Continuing my Troglodyte series of generated images, I began rewriting the long prompts I'd been using, putting the main description of the room first, followed by a phrase about the cave, followed by other items and details. I noticed after a while that it has been harder to elicit images with lots of cave decoration (stalactites, etc.). I began to wonder if the elements of a prompt were somehow treated like the ingredients list on a cereal box, in order by quantity (or by importance in this application). For example, here is a prompt I used a year ago for "Cave Dining Room":

A room in a spectacular cave that has many stalactites and stalagmites, with flowstone on the room's walls, fitted out as a grand dining room with a long table and at least twelve chairs, a chandelier over the table, and a buffet stand nearby

This is the prompt I used in the past few days:

A dining room in a spectacular natural cave with stalactites and stalagmites and flowstone, with seating for twenty or more, with buffet to the side and chandeliers from the ceiling, plus a grandfather clock and a large pantry, and the floor is natural stone with a patterned rug under the table

Here are the resulting images, both from Leonardo AI. Note that the settings were not exactly the same, but here my focus is on the difference in the amount of cave decoration.



The upper image is a little more "cavey"; the lower image is the best of a dozen or more attempts to get the feel I wanted. Some of the images had ceilings that looked more like tangled tree roots, others were almost smooth, though rounded and arched.

In some settings Leonardo AI has an option for "AI Enhancement" of the prompt. It also implements the Style and other settings by modifying the prompt internally. I could only get an inkling of the latter phenomenon by saving an image, because about 50 characters of the prompt used are included in the file name. I say "an inkling" because an AI-enhanced prompt balloons to a few hundred characters. In either case, by studying prompt enhancement, I find that enhancement is primarily done by adding adjectives and sometimes adjective-noun groups (such as the phrase "vibrant cinematic photo").

I designed an experiment to see how much word order matters when a prompt consists entirely of nouns:

meadow, mountains, flowers, butterflies, birds

In Leonardo AI's Classic Mode (its other mode is Flow, which I'll say more about in another post), I first used the Flux Schnell model and the Creative style, 16x9 aspect ratio, small image size (1184x672), with a fixed seed of 142857. When I downloaded the image, the file name was

Flux_Schnell_a_surreal_and_vibrant_cinematic_photo_of_meadow_m_0.jpg

Thus, the prompt had been enhanced because of the Creative style. The program also appends a number so it can distinguish repeated uses of a prompt.

To keep to the bare 5-word prompt only I switched the style to None. Here are the two resulting images, full size (1184x672), enhanced prompt above, 5-word prompt below:


The images are very similar, with some interesting differences. The upper (enhanced) one has no birds; of the five birds in the lower (ordinary) one, the two birds at upper left replaced ambiguous-looking butterflies and a butterfly at far left appears to be a bird-butterfly hybrid. The trees are similar, but the mountains have certain differences, and the enhanced image appears more stormy or foggy. Take note of the yellow flower at bottom center. It is one of several persistent elements from image to image, with only one significant variation I'll point out later on.

The next two images have "rotated" prompts, first "mountains, flowers, butterflies, birds, meadow" and then "flowers, butterflies, birds, meadow, mountains".


In each image there are three birds flying in the distance, and in the upper image in particular, a couple of rather ambiguous flying things. You may have noticed that all the butterflies are Monarchs. A significant change from the upper to the lower image is the bokeh (out-of-focus look) in the distance below, whereas the mountains are sharp above. The next two images are from the next two rotated prompts, "butterflies, birds, meadow, mountains, flowers" and "birds, meadow, mountains, flowers, butterflies".


The upper image has no clearly-defined birds. The five butterflies at the top in a cluster are all distorted, having either birdlike aspects or extra wings. The upper image also has out-of-focus mountains and trees, but in the lower image everything is in focus, plus there are added trees to the left. That is all the prompt rotations.

As a further experiment I added a few words to the prompt, and then as a last experiment, added more words to make it a descriptive phrase. The two prompts are:

birds in the foreground, meadow, mountains, flowers, butterflies

birds in the foreground of a mountain meadow with flowers and butterflies


In each image, a single bird is in the foreground, as requested, but not "birds". Nothing is in the sky; all the butterflies are also in the foreground. The big yellow flower has been replaced in the upper image by three smaller flowers. Both images have significant bokeh, in both background and the immediate foreground. The trees and mountains are also a bit more different from the prior six images, moreso than those six differ among themselves.

Note that all of these use the same seed: 142857. It's a favorite number of mine, being the repeating unit of the decimal expansion of 1/7. A couple of times I tested seed consistency by repeating the generation without changing anything. So far as I could tell, the images were pixel-by-pixel identical. So the only thing available to cause differences between the images would be differences in the prompt.

I don't know how a prompt is turned into "tokens", which are numbers that represent conceptual entries in a database of "meanings", as "meanings" might be understood in the context of generative AI. The order that they occur clearly matters, but not by a great deal. Adding directive words, such as "in the foreground" and "glue words", the little articles, conjunctions, and prepositions between the nouns, also made a difference, otherwise the last two images would be identical.

I don't know why some images are in focus throughout, while others have various levels of bokeh in the background and/or immediate foreground.

I have learned a couple of things, and uncovered further mysteries about art generation. The adventure continues!

No comments: