Polymath at Large: ai experiments

Showing posts with label ai experiments. Show all posts

Sunday, April 20, 2025

My Flying People experiment

kw: generated art, ai experiments, surveys, simulated intelligence, prompts, prompt adherence

Introduction

This is quite long so I include headings.

I remembered a science fiction story that I read decades ago, about a planet of creatures that looked a lot like humans, but had wings and flew about. Pondering a way to illustrate the central idea of the story, after some experimentation, I came up with this prompt:

On a planet with low gravity and dense air, many winged men and winged women are flying into and out of very tall buildings with doorways and landing platforms at every level

I was primarily interested in the great variety of imaging Models offered by Leonardo AI, and I had thousands of credits available. I surveyed nearly all the possibilities in Classic State, which yielded 184 images, all based on a particular Seed; more on that anon. I also used the prompt to produce much smaller suites of images with Dall-E3, DesignStudio, Gemini, and ImageFX. To jump ahead to a useful conclusion, I found that Dall-E3 produced images closest to the meaning of the prompt; in the lingo of the field, it has the greatest Prompt Adherence. Here are two examples, the best of the 24 images I gathered from Dall-E3:

I am showing two images because the first has the best people and the second has better buildings. When prompting an art generating program, it takes persistence and cleverness to get good adherence to a complex prompt. BTW, did you notice the figure with wings on backward?

Models and Styles

Leonardo AI (henceforth Leo) has thirteen Models or Presets and each has a number of Styles, as many as 23. This is a list of the 13, with the number of Styles each includes, and the cost in credits per four images, the usual number. The free program assigns 150 credits daily. It is easy to use them up pretty fast! The basic monthly subscription assigns 8,500 credits per month for $12. These pile up fast unless you are very active. I wish there were an in-between subscription level such as 3,000 credits for $5.

The total number of Styles is 61, and the number of Model+Style combinations is 194. I did not quite use them all. Several Styles are named Monochrome or include B&W in their name and those don't interest me.

Seeds

Leo also has the option of using a fixed Seed. Leo's help text says that this Seed is used to generate the "noise" that the reverse-diffusion process starts with to produce an image. That is not all there is to it, because the Seed value also influences the Transformer (the routine that figures out object identity and placement, which defines the goal toward which the diffusion algorithm operates). Seeds in Leo (also DreamStudio) have six digits.

When I set a fixed Seed, I tend to pick Seed values that are a multiple of 999,999/7 or 999,999/13, all of which have six digits except 999,999/13, which is 76923. After a period of noodling around, I found the region around 428571 (3*999,999/7) most interesting and settled on 428575. All the images shown here were produced using this Seed.

Selection 1: Leonardo Lightning Style Gallery

No matter how I may group the images, it would be tedious for the reader to wade through a discourse on 184 images. Thus I picked certain themes. I chose primarily certain Styles across all the Models, but I will start by showing the gamut of Styles for the Model called Leonardo Lightning (or LLightning), which runs faster than the others and costs the least—two credits per Medium-size image (1280x720), while for most Models the cost is 3 or 4, and it goes up to 12, and even higher where larger images are available in certain Models.

Forthwith, the gamut of LLightning images, screen-captured three across from File Explorer, so the file names can be seen:

The last image above is from the next set, from the Model Phoenix 0.9.

LLightning is unique among Leo's Models; 16 of the 21 Styles shown here produced things with wings, but only 7 are winged people. The third Style, Cinematic, has what appear to be human-sized bats, but they may be humans with bat wings. It is hard to tell which, even on the full size image. Two others show beetle-like flyers, two have birds, and the flyers in the other 4 images are unidentifiable. Finally, the images from None and Unprocessed are apparently identical, which I find logical: both claim to be doing nothing "extra" to the Model. This is also seen for the Model Cinematic Kino, the only other Model that offers both None and Unprocessed.

Compared to the other Models, this mix in LLightning is interesting. Four Models (the two Phoenix versions and the two Flux versions) adhere to the "winged" part of the prompt 100%, although only the two Phoenix versions have winged persons in all images, while the two Flux versions have more birds and fewer people. On the other hand, two Models (Graphic Design and Stock Photography) never produced wings on anything. Most of the Models yielded low percentages of winged persons. Anime is unique in a different way. Half of its Styles produced images with winged humans and two Styles have airborne humans without wings (one hopes they are floating, not falling). Only two Anime Styles had no wings at all.

Selection 2: Model Galleries for nine Styles

Style 1 = None (10 uses)

Now I will focus on particular Styles as developed in different Models. The Styles to be presented are those having larger numbers of Models that use them, in descending order by usage numbers. The first is None, meaning no Style was applied. This (non-)Style is available for the largest number of Models (10), with a modification to be mentioned below. I used a search to isolate each set of images, and a quirk of the search function is that the images are presented in reverse order.

In the next-to-last row of images, the first two appear identical. Upon very close inspection I find a few tiny differences. In the row above that, the second and third image are, so far as I can tell, identical. This shows that behind the scenes Portrait Perfect and Cinematic Kino use the same engine, as do Graphic Design and Illustrative Albedo. So in this case the 10 Models produce 8 unique images (discounting a few nearly invisible differences in one case). These images show what each Model produces when it is not constrained by a Style.

Style 2 = Dynamic (8 uses)

Dynamic is the default Style for the 8 Models that use it.

For the Dynamic Style, the image I like best is for Phoenix 0.9. It is the best match to the image I had in mind after reading the story, so long ago. Comparing all these images with the prior set, I find that for Flux Schnell the Dynamic Style produces the same image as None. Similarly for Flux Dev, Phoenix 1.0 and Phoenix 0.9. For the other Models, these Styles produce significantly different results. Three of the Models—Illustrative Albedo, Cinematic Kino and Lifelike Vision—have what I call "winged structures", although one of them (for Cinematic Kino) looks like an immense beetle.

At this point note that "doors at all levels with landing platforms" is seldom found, and that is primarily in Phoenix 0.9.

Style 3 = Portrait (8 uses)

As we'll see later for the Fashion Style, Portrait often emphasizes a central figure, although in the case of Illustrative Albedo, that figure is a flying structure, looking like a giant crab with 6 wings. The image from Lifelike Vision has a figure with wings that are more like hang glider wings, rather than bird wings. But, wings they are.

Style 4 = Stock Photo (7 uses)

Stock Photo is the only Style used by the Stock Photo Model. Its image is almost identical to the one produced by Cinematic Kino, but not entirely (one must look hard to find the differences). The other 5 Models all yielded winged things with this Style, but only the image for Phoenix 0.9 has winged people.

Style 5 = Ray Traced (7 uses)

Now we can start to see that certain Models, such as the two Phoenix Models and the two Flux Models, have the primary influence. In other cases, the Style seems to be "stronger" than the Model. Other than having a brighter and more colorful appearance, Ray Traced is similar to Stock Photo. For this Style, only the Phoenix Models produced flying people.

Style 6 = Illustration (7 uses plus "Anime Illustration")

It is more clear with these, compared to the prior sets, that the Style is paramount for LLightning, Illustrative Albedo, and Lifelike Vision. Here, Anime Illustration Style in the Anime Model joined the Phoenix Models in producing flying people.

Style 7 = Fashion (7 uses)

Here is a Style that produces winged people, almost 100%! Lifelike Vision's person has a winglike, flowing robe. Perhaps the presence of about 8 wings on the person in the Cinematic Kino image makes up for that… As one might expect from the name, a "fashion model" is front and center.

Style 8 = Creative (uses = 7)

Following trends we've seen above, some of these differ from other Styles for a particular Model, while a couple of them are more similar. Only Flux Dev and the Phoenix Models produced flying people. The flying things in the LLightning image seem to be huge insects though one of them has feathered bird wings. I don't know why a scattering of these images feature hot air balloons.

Style 9 = 3D Render (7 uses, with two added sizes)

As mentioned earlier, most of these images were generated for the Medium 16:9 size, which is 1280x720 for most Models, but is 1184x672 for the Phoenix Models and the Lux Models. I did a side experiment to show the effect of changing image size with Phoenix 0.9. You can see in the file names for "LPhoenix09" the numbers 01a, 01b, and 01c. The image sizes are 1184x672, 1376x768, and 1472x832. When I want to use any Phoenix image for a 16:9 wallpaper, these sizes are a little off. Leonardo can upscale an image so it is larger than 1920x1080 (full HD), but then trimming is needed to get an exact ratio. Changing the size causes changes in the overall image, though the three images are similar.

For this Style, the Phoenix Models produced flying people, but the others differed: LLightning and Flux Dev made flying beetles, Flux Schnell made birds, while Illustrative Albedo and Lifelike Vision produced lots of balloons but nothing with wings.

Wrap-Up

The Leonardo AI Models definitely have minds of their own. Only images produced by Phoenix 0.9 had both elements, flying people and towering buildings with landing pads at upper levels. Few other Models had the buildings as I had asked. Generally, the images are somewhat inspired by the prompt, sometimes rather distantly. It would take a lot more investigation, letting the Seed be randomly produced, to see whether any particular Model+Style is capable of better compliance.

I use art generating software as a "commissioned artist". A few of the Models in Leonardo, the more costly ones, are reasonably compliant. Programs other than Leonardo vary in their Prompt Adherence, and Dall-E3 is probably the best. The rest of Leo's Models, in most of their Styles, are fun to experiment with, but are unlikely to yield results that well match anything but the simplest queries.

I did not test a rather new way of using Leonardo AI, Flow State. It has a different way of doing things. Neither did I turn on "AI Enhancement" for my prompt in those Models that offer it. What we have here is complex enough already.

The folders of images I produced are a useful Gazetteer of possibilities that I can use in the future to select an image generation routine.

Sunday, March 30, 2025

An image and its squeezed version

kw: ai experiments, simulated intelligence, art generation, photo essays

Time to drop the other shoe. The first image is the square rendering that Gemini produced when asked to create "A desert scene with exaggerated mesas and steep mountains around an alluvial valley, extremely clear air, digital art". I used IrfanView to resize from 2048x2048 to 2048x1152, for use as wallpaper on an HD monitor. Note that the air isn't as clear as I'd have liked, but the training images probably all have haze in the background.

Saturday, March 29, 2025

Squeezing a generated image

kw: ai experiments, simulated intelligence, art generation, photo essays

I find various ways to get around the "square image" limitation of Gemini. Dall-E3 also makes square images initially, but then one at a time you can select "Resize" and "4:3", which actually produces an image 1792x1024, or 7:4. I used Dall-E3 as a test bed, using a prompt that requests a vertically-exaggerated image. It could then be vertically squeezed from square to a 16:9 aspect ratio and still look realistic, or at least pleasing. One of these three images was produced by the Resize function, and then cropped to 16:9, and the other two began as square images that were squeezed by anisotropic resizing in IrfanView to 16:9. Let me know in a comment if you can see which of the three is the unique one, unsqueezed.

Thursday, February 20, 2025

DreamStudio Style Gallery

kw: ai experiments, ai art, art generation, styles

I find that strong contrast of elements provides a real test of prompt adherence by an art generating program. This came to light while I was enjoying the Man Cave (Troglodyte) concept, depicting offices, living rooms, kitchens, and so forth inside caves, using the various programs. Generally speaking, Dall-E3, ImageFX, and certain Presets for Leonardo AI produced more pleasing and realistic caves while depicting the "rooms" I prompted for, compared to Gemini and DreamStudio.

I began to investigate the SDXL 1.0 engine in DreamStudio a bit more to see what it takes to induce it to make better caves, choosing it over Gemini because it has more "knobs" I can turn. I decided to first gather a "style book" of the seventeen styles, using a fixed seed (142857, which is 999999/7) and a square aspect ratio. You might find these image montages a useful reference. I grouped them six at a time. The prompt is included in all the file names, right after the Style name and the seed value.

These first six show some commonality: The main desk a little to one side, an opening in the ceiling, and an archway at the back that usually leads deeper into the cave. The number of chairs, the presence of secondary workstations and bookshelves, and the style of flooring are all quite variable. Only the Digital Art style actually has any cave decoration (stalactites). The next six:

The common features seen in the first six are generally present. Fantasy Art style also has stalactites. The Origami style is quite spare, and while a ceiling opening is not seen, the lighting indicates that one is probably present. The last five:

Numbers 16 and 17 are out of order because I neglected to download #16 until after I had downloaded #17, and I have a sequence order of "file date" set in File Explorer. Note that the 3D Model style is the most similar to the Enhance style (#2). The trends seen before continue. Pixel Art style takes a whack at making stalactites.

If I want to make a Troglodyte series using DreamStudio, the best style to use is either Digital Art or Fantasy Art. The other styles mostly produce a cave that looks like the undecorated portions of Mammoth Cave, which resembles a series of long concrete tunnels.

Sunday, February 09, 2025

Binary clock concept

kw: ai experiments, binary time, binary clocks, art generation, simulated intelligence

Near the end of 2024 I was thinking about what clock faces would look like if a culture had a binary concept of time. That is, an 8- or 16- or 32-"hour" day, based on, for example, 32 divisions per "hour" and 32 further subdivisions. Everything based on powers of two. I decided to have several art generating programs attempt to draw a clock dial ("face") with sixteen symbols on it…with no success! Every program stuck to 12- or 24-hour format, with one exception.

This image generated by ImageFX (IFX hereafter) shows a dial with 13 items that look like jewels. Their spacing is rather uneven. Other scales around the dial have as many as 21 symbols, also not evenly spaced. If nothing else, IFX is good at producing unique symbols.

I had in mind a culture, perhaps on another planet, that had no contact with our Babylonian 24-60-60 scheme. They developed a number system based on binary digits. Perhaps their "hands" have four or eight "fingers".

I learned to think in sixteens when I had to write a lot of computer code for two different operating systems. One was based on Octal digits (0-7), and another on the much more common Hexadecimal digits (0,1,2…8,9,A,B,C,D,E,F). My colleagues and I were adept at thinking in 8's and 16's.

Let's consider a 16-hour daytime and 16-hour nighttime. And let's use words from another language to get away from the English terms:

The planet rotates in one nichi composed of 32 jikan.
One jikan contains 32 bu.
One bu contains 32 byoh.

This language has no inflections for pluralization. 32x32x32 means that one nichi is divided into 32,768 byoh. If the inhabitants are of similar size to humans, perhaps their heart beats at about the same rate. This implies a nichi that is about 40% as long as an Earth day. If this is intended to pertain to a culture on Earth, the byoh would be about 2.64 seconds.

This is another of the images IFX offered up. The outer ring has 32 symbols, spaced evenly, so perhaps this symbol set could serve for a full-nichi dial. However, the next ring in has 19 symbols. Perhaps we can posit that the planet's orbit is divided into month-like periods, let's call them tsuki, of 19 nichi each. That's unlikely in a binary-based culture. Ditto for the outer ring of 28 "petals". And I am not sure what to do with the weird spiral. Clearly, I am not going to get far using SI, at least not yet.

By the way, look carefully at the two gears below the spiral. The teeth don't mesh. IFX "knows" what gears are, and that they have something to do with clocks, but it doesn't "know" how they work.

I created a set of symbols based on numbers used in Sumerian cuneiform:

The basic set is four. Two fours stacked is 8, 4+8 = 12, and two eights are 16. However, it might be better to devise a symbol for zero (a filled circle will do; it's what the Sumerians used), so the "16" would look a lot like our "10". On the other hand, perhaps such a culture would not be ready for a 2-symbol number until after 31 (the bottom two symbols set next to each other, effectively 24+7). Then the single tall wedge followed by the circle would represent 32.

I wanted a way these could be arranged around a circular dial. When we have numeric digits on a clock, they are usually upright, but we place Roman numerals all pointing outward from the center, like the pseudo-Roman symbols in the image above. The next image is a concept of these wedge-digits arranged that way.

Making this dial with PowerPoint was easier than I thought. Right away I noticed that when you turn a shape by its handle, the angle is shown in a status line on the Format menu. 360°/16 = 22.5°, so that was easy. The lines in the diagram help line up the symbols correctly.

I let all this percolate for a month or so. Then I went back to the art generator programs and tried a different tack. After much experimentation I settled on this prompt:

A colorful circular dial with sixteen ordered symbols evenly spaced around it

A few hours of playing around yielded lots of interesting images. I'll present 27 of them, from five programs, with a bit of discussion after each set of nine.

These are three each from three programs: Gemini, Dall-E3, and DreamStudio. The prompt for the first Gemini image left out the words "colorful" and "ordered", and "colorful" was added for the other two. Out of a large number of offerings, only the first two had sixteen items, although they are numbers and have repetition. The third is admirably wonky, but nothing like a clock dial, though the outer ring does have twelve items. All other images were produced using the full prompt.

Dall-E3 produced one dial with 20 symbols in the outer ring, and 12 in the inner rings; a dial with 24 symbols plus four knifelike items intervening and an inner ring of 24; and the third image goes off the rails in a big way. If DE3 were to produce a 16-member ring, it would be purely by chance!

DreamStudio went farther off the rails than that, and stayed there. Some of the rings in these images can be counted, and some, not so much. None is 16, and some have an odd number of symbols. Another set:

These were all produced by Leonardo AI, in various Styles. In the top row we have a dial with several sets of 12 items, primarily Roman numerals, secondly a dial with a main ring of 24 symbols, a mixture of Romanesque and "various", and lastly a dial that indeed has 16 symbols of alternating sizes (I like that!), while its inner ring has eight divisions.

In the middle row we first have a ring of 12 larger symbols, then a ring of 16 (Yay!); then a dial with two rings of 14; and the third dial also has 16, with the whole business offset by a half step, or 11.25°.

The bottom row starts with a dial that looks like needlework, and has 16 items. The second dial has an outer ring of 19 symbols, a narrow ring of 18 digits and digit-like symbols, then a compass-rose-like dial with 12 divisions. The last dial has 16 rather complicated symbols, and all the rings within it are also sets of 16. Now we're getting somewhere! Now the final set:

These were all produced by ImageFX. Most of these are easier to count, and several achieved at least one 16-symbol dial. At upper left, the counts are 16 and 10; next to it, they are 15, 9, and 6; and the "rainbow dial" has 16 and 16 only. Another great result.

In the second row we first have a dial with 14 symbols, a narrow ring of such variety I can't determine how to count it, and an inner ring of 8 symbols; secondly an oblique view of a dial with 15 symbols and an inner ring of 7 or 8, unevenly spaced; and thirdly a dial with 16 symbols, a narrow ring of 16 small symbols (maybe pronunciations?), and a narrower uncountable ring of many symbols.

The bottom row starts with an oblique view of a dial with 19 symbols and an inner ring of 12 divisions with barely visible symbols; then a dial with 16 symbols, a narrow ring of 16 numbers (in no particular order but all have 2 digits), a ring with numerous "words", then a very small ring of alternating colors totaling 16 symbols; and finally a dial with 16 symbols, and several rings with eight members each. This one could also prove useful.

Experiments like this show how the training sets of the SI programs affect the images they produce. Getting away from the notion of "clock" made it possible for a couple of the programs to generate images that could be useful to illustrate a clock for binary timekeeping. We get the most interesting results when we explore the edges of such software's capabilities.

Monday, February 03, 2025

AI can't tell time

kw: ai experiments, prompts, ai art, failures

In various corners of the universe of knowledge, SI (Simulated Intelligence) is manifestly ignorant. This can be seen in certain everyday tasks, such as reading (or drawing) an analog clock. I heard mention that most advertising for clocks and watches shows the hands set to 10:10 because ad writers think that has the most attractive appearance. Since SI has no knowledge of what clocks even are, or how the hands show time, they are dependent on their training image sets, which are only useful if there is text accompanying the images. I decided to see how various art generators would handle this prompt:

An image of a very decorated mantel clock showing the time as 4:15

I first used Gemini. This is the result, showing my prompt (one has to tell Gemini this is to be an image or picture):

The program did a good job with the decoration. That is its strength. But, sure enough, the clock's hands are pointed at 10:10, or very nearly so. If you look closely, the hour hand is exactly at the 10, where it should be 1/6th of the way to the 11.

Gemini produces only one image at a time, in contrast to all the other programs at my disposal.

I next tried DreamStudio, the most recent program I use. I set the number of images to make at 2, because I pay for credits, and each image costs something. Using the same prompt:

DreamStudio is playing a trick in the first image. The hands have a "head" at both ends, so the time being indicated is ambiguous, but one interpretation is still 10:10. Though the hands have different shapes, it's also hard to tell hour from minute hand, so eight interpretations are possible! Don't try to teach your kid to read a clock that has such pathological hands!

The second image at least has a hand pointed at the 4, but given that the other is pointed at the 8, indicating 4:40, the hour hand should be a bit more than halfway between the 4 and the 5.

Next victim: ImageFX (driving Imagen 3, the same as Gemini). It is free to use, so I let it run four images. I also left the aspect ratio at 16:9, the setting I usually use with this program.

All hands point to 10:10. the expected result. Next, Leonardo, using the "bare bones" Leonardo Lightning style and the default (Dynamic) substyle:

Here we find an interesting variety of responses.

Upper left: The hands are so nearly the same length it's hard to say if this is 10:10 or 2:50, although whichever hand is the hour hand, it's pointed right at the digit, not advanced as it should be.
Upper right: This looks the most like a real clock. The hour hand is between the 4 and the 5. It still isn't showing 4:15.
Lower left: The hour hand is near the 6, but on the wrong side of it, unless it is just a little too far over and the time should be read as 5:40.
Lower right: 4:40, with a misplaced hour hand, as seen before.

Finally, here is the response from Dall-E3 in Bing:

Assuming I've figured out correctly which hand is which in each case, the times shown are 10:07, 2:50, 10:09 and 12:55.

So there you have it. Not one 4:15 in the bunch.

Thursday, January 16, 2025

Compliance among Auto Art programs

kw: ai experiments, simulated intelligence, automatic art, comparisons, generated art

I like caves. In the post Troglodyte Fantasy I reported on a project to generate images of about a dozen rooms built into cave spaces, using two different art generators. I experimented with several others, and I conclude that the various programs vary significantly in how much they conform to or comply with the details of a prompt. I used long prompts in particular for this project. Here is the first one, which was intended to reify my ideas for a "man cave":

A room in a spectacular cave that has many stalactites and stalagmites, with flowstone on the room's walls, fitted out as an office with a desk and chair and desk lamp and two large computer monitors, with a bookshelf full of books to one side and two smaller side chairs.

Note that the room inventory is one desk, one desk chair, two side chairs, a desk lamp, two computer monitors, and one bookshelf. The milieu is a cave as described.

So far, I have produced images for all the rooms using three programs: Leonardo AI, ImageFX, and most recently DreamStudio. I also produced several versions of the Cave Office image using Gemini, Dall-E3, and Playground. The degree of prompt compliance these programs exhibit is quite variable, both from program to program and within the various "styles" or other toolsets of a program. I show some findings below, first for the four programs that I managed to "persuade" to hit nearly all the goals. Here is an image montage:

DE3: Dall-E3 – Everything is there, plus an extra bookshelf and several extra lamps in addition to a tiny desk lamp. We also see a view outside the cave through an archway.

DS: DreamStudio – There is only one side chair. There is a bonus monitor and floor lamp. However, it took the production of dozens of images to get this one.

GEM: Gemini – No side chairs, but everything else is there. The desk lamp is off, and the cave in general is the darkest one of these four. This was cropped from a square image.

IFX: ImageFX – Everything is there, plus an extra bookshelf and extra desk lamp. I understand that both Gemini and ImageFX use Imagen 3 to generate images, but there must be different training sets in the background.

The other two programs have numerous "style" settings, so in the second montage I showcase two variations for each program:

Leo: Leonardo AI. On the left, style and substyle "Phoenix" and "illustration", which explains the drawn appearance. Everything is there, although the two lamps stand beside rather than on the desk, so there is no real desk lamp. I am not sure what the green tree in the corner is doing there! "Phoenix" is billed as being extra-compliant to prompts.

On the right, style and substyle "Lightning" and "vibrant", so color and contrast are enhanced. It's hard to see where a second monitor might be. Everything else is there, with added chairs and tables and table lamps, like a mini-conference sidebar. Note that Leonardo AI has various levels of credit usage for different styles, and Phoenix costs 2.4 times Lightning, while most other styles cost 1.4 times Lightning, which is promoted as fast and cheap.

PG: Playground. On the left, using the SDXL (Stable Diffusion XL) engine, probably version 1.0. There is only one side chair, but an extra bookshelf opposite, and a smaller bookshelf at the far end of the room.

On the right, using the PG30 (Playground 3.0) engine, which is billed as "very compliant to prompts". That is apparent here. Everything is there, with nothing extra. Sadly, Playground has dropped its image generation interface and announced it is going into graphic design. I'll miss it. It had the most options, but a big learning curve.

This doesn't get very deep into the use of these programs. At present the only program I have paid into is DreamStudio, because they have a pay-as-you-go plan, similar to the one Dall-E2 had. The others have various subscription plans, which I avoid. I haven't tried editing or outpainting with any of these except Playground.

It is likely I could edit an image to add something I think is missing. But I prefer to get an image that is closer to what I want from the start, so little or no editing is needed. In the past I used outpainting to turn a square image into a wide-format image. That is not needed now, except for Gemini, but when asked for "wide format" it produces an image a little zoomed out so you can crop it, and its original images are 2048x2048, which helps.

Friday, November 29, 2024

Troglodyte fantasy

kw: generated images, ai experiments, caves, cave dwellings, underground living

I have been experimenting with image generation "AI" software for just over two years, having first tried out Dall-E2 on November 11, 2022. I frequently use the software to produce various kinds of backgrounds for Zoom meetings, to use with a green screen. Some are forest glades, some are mountain scenes, some are desert scenes, some are views of alien planets, and some that are intended for "business" sessions are laboratories or offices.

I had the idea to make images of an office in a cave. I love caves. Preferred vacation destinations are places such as Carlsbad Caverns, Mammoth Cave, Luray Caves and the several caves along the Blue Ridge Parkway. I had several art generators produce hundreds of images, and kept about thirty of them. Two are standouts:

This is from Leonardo AI, using the Preset "Illustrative Albedo" and Style "Stylized Illustration", which produced this colorful result. It's a bit fanciful, which appealed to me.

To see an image full size, click on it. Click next to the image to return to this page.

This is from ImageFX (in Google Labs), which does things differently, having an option to choose from numerous adjectives, which then are appended to the prompt. I appended "Cinematic". This looks more like a natural, though dry, cave (I wouldn't want an office in a drippy, living cave!).

A week or two later I decided to generate the rest of the rooms of a cave dwelling, one I might like to live in.

The original prompt for the Cave Office was an extra-long 50-word prompt, about 275 characters. Several of the art generators put the first 32 characters of the prompt in the file name, in addition to the name of the program and, often, the "seed" number (The seed is supposed to allow you to regenerate an image and then modify it in a later session). I keep a text file of long prompts so I can re-use them. This helps with product comparisons. I change the file name to a prompt identifier plus a program ID, the date, and a serial number.

I eventually prepared twelve more "Trog" series prompts, ranging from 17 to 48 words. Note that most of these programs take in no more than 77 "tokens" (an unusual number...), and a "token" can be a word, a syllable, or a punctuation-space combination, so I don't let prompts go much beyond 50 words.

Is Troglodyte a new word to you? It means a person or animal that lives underground. In old literature it was used in a derogatory way to refer to, for example, underground-dwelling "dwarves".

I won't dig deeper into the technicalities. Here I want to showcase the different rooms I dreamed up, and the way each of these two programs responded to the prompt. I present them four rooms at a time, and each "room" is my favorite from 4, or 8, or more images offered up by the art generator. I start with all the Leonardo AI offerings:

Clockwise from upper left:

Entryway. An arch has been built into the cave mouth, and case of shelves full of pots stands nearby. I asked for a coat closet; this is the only "closet" I was offered! Note that this combination of product and its presets has every room adjacent to a skylight or a cave exit.
Living Room. The floor is paved. I'd have asked for Grow Lights for the houseplants if I'd realized the program would create some.
Kitchen, complete with a window to outside. The prompt included "refrigerator" but none was offered.
Formal Dining Room with a chandelier. The pool is a bonus.

Sitting Room and Library. It takes a dry cave to be a safe place to shelve books.
TV Room. I debated asking for theater-style seating, but opted for this look instead.
Office. The original cave room, which inspired all the others.
Game Room. What's a grand home without a billiards table and some board games?

Hallway to Bedrooms. This was as close as I could come, in the Leonardo offerings. The floor is close to natural. I asked for wall sconces, and got lots of them.
Master Bedroom. I asked for a canopy bed, but never received one.
Utility Room and Laundry. The most natural floor of all the rooms. The tool bench is minimal. Though no stairs are evident, this is clearly a "basement" area.
Walk-in Closet. This is intended to be attached to the Master Bedroom. One presumes the view is from an archway in the bedroom.

OK, let's compare the ImageFX offerings, in the same order:

As before, clockwise from top left:

Entryway. Here, the entry is apparently around a bend. Coat closets and a chair make it inviting.
Living Room.
Kitchen. Complete with refrigerator! IFX is more compliant to details in the prompt.
Formal Dining Room with chandelier. The buffet off to the side, with warming pans, is a nice touch.

Sitting Room and Library. I didn't ask for the Oriental rug, but I'm glad it was included.
TV Room. Here the seating is facing the screen, in an informal arrangement.
Office. Of all the rooms, this looks the most like the cave is a backdrop rather than integral.
Game Room.

Hall to Bedrooms. This is what I had in mind.
Master Bedroom. With canopy bed! The knitted rug is as I asked for every time; here it is most evident.
Utility Room and Laundry. A better work bench.
Walk-in Closet. The central dresser is nice. I had asked for both men's and women's clothing to be shown. IFX did so.

The big lesson for me is that tremendous variety is available; it takes lots of experimentation to learn the uses and limitations of each tool. The IFX images tend to be low key. To make a presentable image, I'll raise the lightness with the Gamma tool in IrfanView, which I use to trim an image and add a signature (as any artist would!). I use Upscayl to double the x- and y- pixel count.

If you have a sharp eye, you may note that the aspect ratio of the images differs between the two programs. Both Leonardo AI and ImageFX have various aspect ratios available. I always asked for 16:9, the same as HDTV, which also matches the screens of my computer setup. However, Leonardo AI images, for all its Presets except "Phoenix", yields images that are 1368x768, or 1.78125 or 57:32. ImageFX images are even wider, 1408x768, or 1.8333... or 11:6.

When I make a Zoom background, or a wallpaper for my Screen Saver, I want exactly 16:9, so images must be trimmed. I'll prepare an essay about that later on.