Using Local AI On The Command Line To Rename Images (And More)

We all have a folder full of images whose filenames resemble line noise. How about renaming those images with the help of a local LLM (large language model) executable on the command line? All that and more is showcased on [Justine Tunney]’s bash one-liners for LLMs, a showcase aimed at giving folks ideas and guidance on using a local (and private) LLM to do actual, useful work.

This is built out from the recent llamafile project, which turns LLMs into single-file executables. This not only makes them more portable and easier to distribute, but the executables are perfectly capable of being called from the command line and sending to standard output like any other UNIX tool. It’s simpler to version control the embedded LLM weights (and therefore their behavior) when it’s all part of the same file as well.

One such tool (the multi-modal LLaVA) is capable of interpreting image content. As an example, we can point it to a local image of the Jolly Wrencher logo using the following command:

llava-v1.5-7b-q4-main.llamafile --image logo.jpg --temp 0 -e -p '### User: The image has...\n### Assistant:'

Which produces the following response:

The image has a black background with a white skull and crossbones symbol.

With a different prompt (“What do you see?” instead of “The image has…”) the LLM even picks out the wrenches, but one can already see that the right pieces exist to do some useful work.

Check out [Justine]’s rename-pictures.sh script, which cleverly evaluates image filenames. If an image’s given filename already looks like readable English (also a job for a local LLM) the image is left alone. Otherwise, the picture is fed to an LLM whose output guides the generation of a new short and descriptive English filename in lowercase, with underscores for spaces.

What about the fact that LLM output isn’t entirely predictable? That’s easy to deal with. [Justine] suggests always calling these tools with the --temp 0 parameter. Setting the temperature to zero makes the model deterministic, ensuring that a same input always yields the same output.

There’s more neat examples on the Bash One-Liners for LLMs that demonstrate different ways to use a local LLM that lives in a single-file executable, so be sure to give it a look and see if you get any new ideas. After all, we have previously shown how automating tasks is almost always worth the time invested.

36 thoughts on “Using Local AI On The Command Line To Rename Images (And More)

    1. You can process pictures with ML , rename them, generate thumbnails after that,

      or you can write script capable of searching for same name and rename all instances of that name.

      regex + GREP (ripgrep) can do a lot for your searches.

      1. What I meant was, this isn’t any more useful than just thumbnails on a file browser. It’s less useful.

        That’s because the names that the LM generates are not really searchable. You do not know which search terms are valid and possible, and you don’t know whether the LM actually used the same tags or terms for the same things in different images, so making a wrong guess about what the AI decided to call something will cause you to miss the correct file in your search.

        In other words, to make sure you find something, you have to eyeball through the list of files you have to see if something matches your interest, which like skimming through the thumbnails, except you’re not seeing what’s in the pictures – only a second-hand description.

        1. I’d have to disagree abit there, even though the model may not use the exact same terms for everything it will be a great tool for grouping and filtering real chaos into things that are easier to search.

          So looking for the jolly wrencher watermark you like putting on your project pictures with some search terms like skull, wrench, and crossbones might also throw up some people holding wrenches or a Jolly roger flag, that 40K skull drenched mini collection etc. But it did almost certainly eliminate every image from that trip you took, the family gatherings etc. And once you have parsed those images searching for those terms that will catch most if not all your project images you can further eliminate other images with other terms – for instance if you are getting lots of your beloved warhammer faction you can probably filter most of those out because the language model will have tagged them mini, warhammer, model, or toy – whatever it actually uses. You don’t need to know the search terms the model will use any more than you do when searching your favourite webstores and search engines – though it certainly helps if you already have that exhaustive list of ‘correct’ terms.

          Which then means a manual scan of the thumbnails by mk1 eyeball should be both much faster and easier – spotting the hopefully smaller number of false positive erroneous images included will be much easier than trying to pick out any of the images you wanted from a massive pile of thumbnails.

          The only way just browsing thumbnails would be better is if you really don’t have many images to parse, or whatever you are looking for is too generic and bland that almost every image will match any search term you could use to differentiate it.

          1. If you were looking for the Jolly Wrencher, you’d prefer the image to be tagged “Jolly Wrencher”, not double guessing and playing twenty questions about what some broken telephone system happened to label it.

          2. >with some search terms like skull, wrench, and crossbones

            If the actual tag used was “cross-bones” or “crossed bones”, “two bones”, etc. instead, you have just eliminated the results you were looking for. Positive search terms will too easily drop out your desired result.

            In order to not exclude the correct result, you have to use negative search terms: things you know are NOT in the image you’re looking for. This can help you whittle down the results to some manageable number, but if you have thousands and thousands of pictures it’s simply easier to guess where that picture might be and go browsing the thumbnails.

          3. Of course you would prefer if the AI actually nailed it first time, but compared to manually searching huge volumes of images playing 20 questions to filter the results to a nice manageable number really isn’t a problem.

            Of course the real solution here is to never have a technophobe relative that has left all their personal data scattered everywhere across their computer with no sane filenames and keep your own data organised so you never actually have to search this way… But you only really have control of one of those elements.

          4. > You don’t need to know the search terms the model will use any more than you do when searching your favourite webstores and search engines

            That’s because those search engines have a second language model doing a “meta-search” to look for related search words that might apply, and mixing those results with your actual search.

            But that results in an endless list of “might be” results that are increasingly irrelevant, and you will never get to the end of it. You just hope that the actual result you wanted is within the first two or three pages. If you happened to hit on the right search term, that is more likely, but if you didn’t then you have to find the result among all the “alt” results.

          5. >playing 20 questions to filter the results to a nice manageable number really isn’t a problem.

            Well it is a problem if you never end up with the correct result, because you filtered it out by adding some positive search term that does not happen to apply to the file you’re looking for.

          6. The point is, you can’t trust that you’ll actually find anything particular with this model.

            You can find some things some times, but it’s no automatic archival system.

  1. Yes, but put this data into EXIF annotation inside of a picture file itself, it is widely used standard, there exists a lot of tools to view / sort images based on this data, so for example you want to view all images with COW on it, you just open your favorite picture viewing program and it will filter them based on your query. Even console tools can search / sort based on EXIF data.

    I am saving names of windows visible in screenshots in EXIF,
    also if i generate pictures for some project then project / client name is also embedded in EXIF, it make life so much easier when client calls you that he wants to know something about old project and you need to find something 10 years old… Pictures are not graphical creative something, they are photos of devices, schemas, wirings, water damage for insurance company, etc

    OR
    as apple does it with PHOTOS app, they put all this metadata into SQLite database after ondevice ML processes your pictures after your APPLE device detects that it is connected to charger.

    ref: https://simonwillison.net/2020/May/21/dogsheep-photos/

    i use ondevice ML also for transcription of company calls, videos can have subtitles / transcription searchable with console tools too.

    1. 100% agreed. This is a great time to use Dublin-core metadata, specifically DC:Description, to store the generated descriptions. Long file names are generally not too useful either.

      1. EXIF data is problematic in the sense that many programs like to stuff in personal data, like user account names, “created by” etc. information that gets leaked across. It can also include data that is a security risk, like program binaries or smuggling out sensitive information.

        So lots of websites, services, or me personally, just tend to bulk erase EXIF data if it has no relevance. Most of the time, the information is redundant anyhow. I know for example that my camera puts out sRGB, so I can assign the correct color profile (or it is assumed anyhow) if needed.

        1. I do not care what other people / services do, having all this data searchable is personal, business advantage for me so i am doing it. Taking examples from mediocrity does not lead to excellence.

          You can store this metadata in external database or as other commenter suggested in external xml files, so all your metadata can be safely stored on your devices and it can still be useful, you can still search for what you need and link results to files no matter where they are stored.

          And how do you remove personal data embedded inside filename ?

    2. The issue I find with any image “tagging” system is that if the tags are not well defined, unambiguous and fixed, then they don’t help much at all since they’re not really searchable. Yet if it is well defined, unambiguous and fixed, it tends to be too restricted because you don’t know what kind of things you want to tag in the future.

      The same thing will usually have a variety of synonymous tags. Searching for one may not necessarily bring up all relevant results, but including more synonymous tags will bring up false positive results and unwanted noise. Knowing which tags might apply requires you to know the entire tags list, which is typically unfeasible because it can be huge and it may be changing all the time. It may be that nobody knows the entire tag list in the first place because the tags are arbitrary.

      If you have a tagging system where the tags can be arbitrary and there is no practical way of knowing which tags might apply, then quite often the only hope of ever finding your thing is to search by listing what you don’t want to find. That is, assuming you actually remember what definitely wasn’t in the picture. That way you at least narrow the result down to a few thousand results that may be manually checked.

      So, if you have a natural language model that is trying to tag images, it won’t be useful for you later unless you know exactly what words it will use to describe things. Will it always say “cat”, or will it sometimes say “domestic feline”? Where’s the list of words that might apply? In the end, it just falls back to opening the picture folder in a file browser and quickly skimming through the thumbnails to find what you’re looking for.

      1. usefulness of structured vs unstructured data.

        Any ( / form ) tag is multiple orders of magnitude better than no tags.
        Synonymous tags – 10 000 documents can be reduced into 50 results which include synonymous tags. What is the problem ? It is useful.

        “know the entire tags list…” Arbitrary does not mean 64 bit binary search region, so non issue.

        You can use ML to search tags. Search ML can find statistically adjacent tags. You can do that even with bloom filter. Non issue.

        What if ML provides more tags for same property, so it puts not only CAT, not only domesticated feline but more tags, and what if ML learns on your text corpus and assigns tags YOU use for those things.

  2. One could add extra step with the LLM to convert the description to a more concise filename.

    For example:

    User: Generate short, useful filename for image described as: “The image has a black background with a white skull and crossbones symbol.”

    Assistant: SkullCrossbones_BlackWhiteBG.jpg

  3. Just downloaded and tried out the system mentioned in the article.

    Truth in advertising: the system is pants at labeling images that it hasn’t been trained on.

    In summary, if you have image files using random characters as names, you can rename them to be images with random words as names, which is somehow better because… AI.

    I recommend that people try out/test AI systems using original content; specifically, not something that the AI was trained on from the net. For my test I had a picture of a combination plasma/welding table I made for the local school, which has never been published on the net.

    It took 23 minutes to assess one image. This is not a deal breaker, I only post it so that people can form realistic expectations. I have CUDA and an older, but still quite beefy, NVIDIA video card installed, so the system should be taking advantage of the GPUs.

    The description was generic and with errors. The slag bucket was mislabelled as a box fan, the table base color was mislabeled as the top, and so on.

    Thinking that the image size might have affected the run time, I resized the image and tried again.

    Same compute time, and this time the description was completely, inarguably wrong.

    Here’s the image and prompt I used for the test:

    https://hackaday.io/project/5283-potpourri/log/226307-llava-example

    Not a comment on Justine’s work which is excellent and bringing AI to the public, only pointing out that the AI is itself not yet ready for prime time.

    1. OxyCuttingTable_onWheels.png

      Seems like artificial intelligence is just coding.. poorly optimised, and resource/power/hardware intensive, with the said 23 minutes.

      That said, if I never practiced oxy cutting on one of these things, I may not be sure what it actually is.

      Bless.

    2. Indeed, and that in a nutshell is the problem with AI in general – it might be very good at something in the conditions it was trained under, but it will also call a lightly modified image of a tortoise an assault rifle… But when you want to sort images that are more inline with the training and not actively hostile it will work reasonably well.

      Though I also think it interpreting that bucket as a fan is not actually a great leap for once – everything about that table says it could be a charcoal forge with fan for forcing the air in you squint a bit, where I’ve never actually seen a table with a funnel and slag bucket like that – welding table are just big metal tables full of holes to securely clamp and square your parts and plasma tables tend to be just a slatted/mesh type table top perhaps with a water tray underneath… Don’t get me wrong it is clearly what it is when you actually look at it, and that design seems like a good compromise in form and function to me, but at a glance it doesn’t look like a welding table to me.

      In your case I’d also suggest that making the image smaller probably created way to many artifacts of that process that blurred out the data the model would have worked on and helped upset it further.

      1. The problem is that the AI doesn’t know that it doesn’t know, so it will always give you some result no matter how nonsensical.

        It’s just a Markov chain in the end. It will march on to some inevitable result given the input data.

    3. As a test, the short filename from this (description post-processed with LLM) would be “MetalTable_FanWorkshop.jpg” for the first run, and “MetalTable_Workspace_Image” for the second run. Still has the fan, but otherwise not too bad.

    4. Porbably `-ngl 35` option is not appropriate for your environment.
      On my environment, i7-10875H 2.3GHz, RTX2060, 64GB RAM, Win10:

      Without `-ngl 35`, output is:
      “`
      The image features a room with a large metal table in the center. On top of this table, there is an industrial-sized blender or food processor sitting on a stand. The table appears to be made of wood and has a stainless steel appearance.

      In addition to the main focus, there are several chairs placed around the room, with some located near the edges of the image. A bench can also be seen in the background, adding more seating options for those who may use the space.
      “`
      llama_print_timings: total time = 45596.79 ms
      => 46sec

      With `-ngl 35`, output is:
      “`
      The image features a room with a large metal table in the center. On top of this table, there is an industrial-sized blender or food processor sitting on a stand. The table appears to be made of wood and has a stainless steel appearance.

      In addition to the main focus, there are several chairs placed around the room, with some located near the edges of the image. A bench can also be seen in the background, extending across the width of the scene.
      “`

      llama_print_timings: total time = 416162.06 ms
      => 7min

      1. Maybe nobody is interested, but this is a follow-up post.
        Without the `-ngl` option, there is no GPU offload.
        My environment is a notebook so there is not enough VRAM to afford `-ngl 35` option.
        On Windows, tinyBLAS is used by default, which is slower than cuBLAS.

        tinyBLAS
        -ngl 27: llama_print_timings: total time = 27794.54 ms
        -ngl 28: llama_print_timings: total time = 27094.44 ms
        -ngl 29: llama_print_timings: total time = 26258.35 ms *fastest* about 26s
        -ngl 30: llama_print_timings: total time = 28569.29 ms
        -ngl 31: llama_print_timings: total time = 66266.51 ms

        cuBLAS
        -ngl 27: llama_print_timings: total time = 11894.43 ms
        -ngl 28: llama_print_timings: total time = 11276.26 ms *fastest* about 11s
        -ngl 29: llama_print_timings: total time = 12093.04 ms

        The output of cuBLAS with `-ngl 29` option is as follows. Other output is identical to tinyBLAS with `-ngl 35` option.


        The image features a room with a large metal table in the center. On top of this table, there is an industrial-sized blender or food processor, which appears to be made from metal and has a silver appearance. The table occupies most of the space in the room, extending almost all the way across the scene.

        In addition to the main focus on the large table with the blender, there are several chairs placed around the room, some closer to the foreground and others further back. A bench can also be seen near the center of the room, providing additional seating options.

    5. Thank you, THANK YOU [PWalsh] for putting in the time to test this and document your work. You have saved a bunch of people a bunch of time. OK, people, now go and use the time saved for something useful :-)

  4. >Setting the temperature to zero makes the model deterministic, ensuring that a same input always yields the same output.

    That’s misleading. “Same input” meaning the same picture. A different picture of the same thing can still generate a different output, so you can’t rely on the LLM to label the same thing in different pictures the same way.

    That spoils the whole point, because you can’t then search for “a thing” and expect every picture with “a thing” to come up.

    1. If there is no metadata on your images you need something like this ‘AI’ to create the data to search – the other apps that exist might be useful for searching the metadata database locations and stuff embedded in the file to search for something. But if that data is missing or erroneous they have no idea that the Jolly Wrencher is anything other a file called “alphanumbericgobbldygook.jpg” probably with a created on and modified date supplied by the filesystem that marks only when it was copied/edited on this device. So congratulations it found an image file with no useful data attached to go with who knows how many other similarly blank files.

  5. IMHO the renaming is an awful idea, but I messed with the same software, though I did not compile it. What I did, and I think is a much better idea is I added the LM output so it was formatted to be easy to pick out and placed at the end of the jpeg metadata so it got embedded right in the image. I was careful to put it at the end in case there was any previous info there, and I surrounded it by tags so it is easy to pull out of the info there if it is not the only thing.

    It is kind of neat, but it really needs a lot of local training to be what I would consider to be useful overall. Looking for a man or a woman or a boy or a dog or a lake is not super useful and just looking at the windows thumbnail of the photos is probably going to get you where you want to be faster.

    That being said, at some point in time there will be good and easy software for local training. When it finds features in photos it would ask you to label them. Over time it would start labeling some guesses and you just have to give it a quick yes or no. After a while it will get to know more and more features around you.

    So it would go from a man and a woman petting a dog in front of their house, it would replace that with Uncle Phil and Aunt Sally petting Fido the dog in front of our house in 2025, some of the info may be from other available metadata in the picture. For example the gps location and the date.

    But with all of that said, I still would not want it messing with the file names..

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.