dust42 12 hours ago

To add some numbers, on MBP M1 64GB with ggml-org/gemma-3-4b-it-GGUF I get

  25t/s prompt processing 
  63t/s token generation
Overall processing time per image is ~15secs, no matter what size the image is. The small 4B has already very decent output, describing different images pretty well.

Steps to reproduce:

  git clone https://github.com/ggml-org/llama.cpp.git
  cmake -B build
  cmake --build build --config Release -j 12 --clean-first
  # download model and mmproj files...
  build/bin/llama-server \
    --model gemma-3-4b-it-Q4_K_M.gguf \
    --mmproj mmproj-model-f16.gguf
Then open http://127.0.0.1:8080/ for the web interface

Note: if you are not using -hf, you must include the --mmproj switch or otherwise the web interface gives an error message that multimodal is not supported by the model.

I have used the official ggml-org/gemma-3-4b-it-GGUF quants, I expect the unsloth quants from danielhanchen to be a bit faster.

  • matja 3 hours ago

    For every image I try, I get the same response:

    > This image shows a diverse group of people in various poses, including a man wearing a hat, a woman in a wheelchair, a child with a large head, a man in a suit, and a woman in a hat.

    No, none of these things are in the images.

    I don't even know how to begin debugging that.

    • exe34 3 hours ago

      Means it can't see the actual image. It's not loading for some reason.

      • aendruk 32 minutes ago

        I’m having a hard time imagining how failure to see an image would result in such a misleadingly specific wrong output instead of e.g. “nothing” or “it’s nonsense with no significant visual interpretation”. That sounds awful to work with.

  • astrodude 2 hours ago

    do you have any example images it generated based on your prompts?

    want to have a look before I try

  • zamadatix 6 hours ago

    Are those numbers for the 4/8 bit quants or the full fp16?

    • dust42 5 hours ago

      It is a 4-bit quant gemma-3-4b-it-Q4_K_M.gguf. I just use "describe" as prompt or "short description" if I want less verbose output.

      As you are a photographer, using a picture from your website gemma 4b produces the following:

      "A stylish woman stands in the shade of a rustic wooden structure, overlooking a landscape of rolling hills and distant mountains. She is wearing a flowing, patterned maxi dress with a knotted waist and strappy sandals. The overall aesthetic is warm, summery, and evokes a sense of relaxed elegance."

      This description is pretty spot on.

      The picture I used is from the series L'Officiel.02 (L-officel_lanz_08_1369.jpg) from zamadatix' website.

    • refulgentis 2 hours ago

      n.b. the image processing is by a separate model, basically has to load the image and generate ~1000 tokens

      (source: vision was available in llama.cpp but Very Hard, been maintaining an implementation)

      (n.b. it's great work, extremely welcome, and new in that the vision code badly needed a rebase and refactoring after a year or two of each model adding in more stuff)

danielhanchen 14 hours ago

It works super well!

You'll have to compile llama.cpp from source, and you should get a llama-mtmd-cli program.

I made some quants with vision support - literally run:

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl -1

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-12b-it-GGUF:Q4_K_XL -ngl -1

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-27b-it-GGUF:Q4_K_XL -ngl -1

./llama.cpp/llama-mtmd-cli -hf unsloth/unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_XL -ngl -1

Then load the image with /image image.png inside the chat, and chat away!

EDIT: -ngl -1 is not needed anymore for Metal backends (CUDA still yes) (llama.cpp will auto offload to the GPU by default!). -1 means all GPU layers offloaded to the GPU.

  • thenameless7741 13 hours ago

    If you install llama.cpp via Homebrew, llama-mtmd-cli is already included. So you can simply run `llama-mtmd-cli <args>`

  • danielhanchen 9 hours ago

    Ok it's actually better to use -ngl 99 and not -ngl -1. -1 might or might not work!

  • raffraffraff 13 hours ago

    I can't see the letters "ngl" anymore without wanting to punch something.

    • simlevesque 6 hours ago

      That's your problem. Hope you do something about that pent up aggressivity.

    • danielhanchen 13 hours ago

      Oh it's shorthand for number of layers to offload to the GPU for faster inference :) but yes it's probs not the best abbreviation.

      • stavros 10 hours ago

        It probably isn't, not gonna lie.

simonw 15 hours ago

This is the most useful documentation I've found so far to help understand how this works: https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd...

  • scribu 7 hours ago

    It’s interesting that they decided to move all of the architecture-specific image-to-embedding preprocessing into a separate library.

    Similar to how we ended up with the huggingface/tokenizers library for text-only Tranformers.

banana_giraffe 13 hours ago

I used this to create keywords and descriptions on a bunch of photos from a trip recently using Gemma3 4b. Works impressively well, including going doing basic OCR to give me summaries of photos of text, and picking up context clues to figure out where many of the pictures were taken.

Very nice for something that's self hosted.

  • accrual 13 hours ago

    That's pretty neat. Do you essentially loop over a list of images and run the prompt for each, then store the result somewhere (metadata, sqlite)?

    • banana_giraffe 13 hours ago

      Yep, exactly, just looped through each image with the same prompt and stored the results in a SQLite database to search through and maybe present more than a simple WebUI in the future.

      If you want to see, here it is:

      https://gist.github.com/Q726kbXuN/f300149131c008798411aa3246...

      Here's an example of the kind of detail it built up for me for one image:

      https://imgur.com/a/6jpISbk

      It's wrapped up in a bunch of POC code around talking to LLMs, so it's very very messy, but it does work. Probably will even work for someone that's not me.

      • wisdomseaker 12 hours ago

        Nice! How complicated do you think it would be to do summaries of all photos in a folder, ie say for a collection of holiday photos or after an event where images are grouped?

        • banana_giraffe 12 hours ago

          Very simple. You could either do what I did, and ask for details on each image, then ask for some sort of summary of the group of summaries, or just throw all the images in one go:

          https://imgur.com/a/1IrCR97

          I'm sure there's a context limit if you have enough images, where you need to start map-reducing things, but even that wouldn't be too hard.

          • wisdomseaker 12 hours ago

            Thanks for the reply, I'll see if I can work it out :)

            • sorenjan 6 hours ago

              You might want to extract the location from the image exif data and include in the prompt as well. There are reverse geocoding libraries and services that takes coordinates and return a city, which would probably make for a better summary of a trip.

  • buyucu 4 hours ago

    is gemma 4b good enough for this? I was playing with larger versions of gemma because I didn't think 4b would be any good.

ngxson 12 hours ago

We also support SmolVLM series which delivers light-speed response thanks to its mini size!

This is perfect for real-time home video surveillance system. That's one of the ideas for my next hobby project!

    llama-server -hf ggml-org/SmolVLM-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM-256M-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM-500M-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF
  • thatspartan 7 hours ago

    Thanks for landing the mtmd functionality in the server. Like the other commenter I kept poring over commits in anticipation.

  • a_e_k 11 hours ago

    I've been noticing your commits as I skim the latest git commit notes whenever I periodically pull and rebuild. Thank you for all your work on this (and llama.cpp in general)!

  • moffkalast 6 hours ago

    Ok but what's the quality of the high speed response? Can the sub-2.2B ones output a coherent sentence?

thenthenthen 9 hours ago

What has changed in laymans terms? I tried llama.cpp a few months ago and it could already do image description etc?

dr_kiszonka 9 hours ago

Are there any tools that leverage vision for UI development?

Use case: I am working on a hobby project that uses TS/React as frontend. I can use local or cloud LLMs in VSCode but even those with vision require that I take a screenshot and paste it to a chat. Ideally, I would want it all automated until some stop criterion is met (even if only n-iterations). But even an extension that would screenshot a preview and paste it to chat (triggered by a keyboard shortcut) would be a big time-saver.

gitroom 13 hours ago

Man, the ngl abbreviation gets me every time too. Kinda cool seeing all the tweaks folks do to make this stuff run faster on their Macs. You think models hitting these speed boosts will mean more people start playing with vision stuff at home?

  • thenthenthen 4 hours ago

    For sure! Llama.cpp runs great on my 10 year old pc and m1 mac!

a_e_k 11 hours ago

This is excellent. I've been pulling and rebuilding periodically, and watching the commit notes as they (mostly ngxson, I think) first added more vision models, each with their own CLI program, then unified those under a single CLI program and deprecated the standalone one, while bug fixing and improving the image processing. I'd been hoping that meant they'd eventually add support to the server again, and now it's here! Thanks!

nico 14 hours ago

How does this compare to using a multimodal model like gemma3 via ollama?

Any benefit on a Mac with apple silicon? Any experiences someone could share?

  • ngxson 13 hours ago

    Two things:

    1. Because the support in llama.cpp is horizontal integrated within ggml ecosystem, we can optimize it to run even faster than ollama.

    For example, pixtral/mistral small 3.1 model has some 2D-RoPE trick that use less memory than ollama's implementation. Same for flash attention (which will be added very soon), it will allow vision encoder to run faster while using less memory.

    2. llama.cpp simply support more models than ollama. For example, ollama does not support either pixtral or smolvlm

    • roger_ 12 hours ago

      Won’t the changes eventually be added to ollama? I thought it was based on llama.cpp

      • diggan 8 hours ago

        As far as I understand (not affiliated, just a user who peeked at the code), Ollama started out using llama.cpp as a runner for everything. But eventually they wrote their own runner in Golang, which is where they add support for new models. So most models you run via Ollama uses llama.cpp, but new stuff their own Golang runner.

    • danielhanchen 13 hours ago

      By the way - fantastic work again on llama.cpp vision support - keep it up!!

      • ngxson 12 hours ago

        Thanks Daniel! Kudos for your great work on quantization, I use the Mistral Small IQ2_M from unsloth during development and it works very well!!

        • danielhanchen 12 hours ago

          :)) I did have to update the chat template for Mistral - I did see your PR in llama.cpp for it - confusingly the tokenizer_config.json file doesn't have a chat_template, and it's rather in chat_template.jinja - I had to move the chat template into tokenizer_config.json, but I guess now with your fix its fine :)

          • ngxson 12 hours ago

            Ohhh nice to know! I was pretty sure that someone already tried to fix the chat template haha, but because we also allow users to freely create their quants via the GGUF-my-repo space, I have to fix the quants produces from that source

    • nolist_policy 10 hours ago

      On the other hand ollama supports iSWA for Gemma 3 while llama.cpp doesn't. iSWA reduces kv cache size to 1/6.

      • vlovich123 10 hours ago

        What’s iSWA? Can’t find any reference online

        • imtringued 9 hours ago

          Gemma 3 has some layers with a context size of 1024 tokens and others having full length. You need to read the Gemma technical report.

simonw 12 hours ago

llama.cpp offers compiled releases for multiple platforms. This release has the new vision features: https://github.com/ggml-org/llama.cpp/releases/tag/b5332

On macOS I downloaded the llama-b5332-bin-macos-arm64.zip file and then had to run this to get it to work:

  unzip llama-b5332-bin-macos-arm64.zip
  cd build/bin
  sudo xattr -rd com.apple.quarantine llama-server llama-mtmd-cli *.dylib
Then I could run the interactive terminal (with a 3.2GB model download) like this (borrowing from https://news.ycombinator.com/item?id=43943370R)

  ./llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99
Or start the localhost 8080 web server (with a UI and API) like this:

  ./llama-server -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99
I wrote up some more detailed notes here: https://simonwillison.net/2025/May/10/llama-cpp-vision/
  • ngxson 12 hours ago

    For brew users, you can specify --HEAD when installing the package. This way, brew will automatically build the latest master branch.

    Btw, the brew version will be updated in the next few hours, so after that you will be able to simply "brew upgrade llama.cpp" and you will be good to go!

  • danielhanchen 12 hours ago

    I'm also extremely pleased with convert_hf_to_gguf.py --mmproj - it makes quant making much simpler for any vision model!

    Llama-server allowing vision support is definitely super cool - was waiting for it for a while!

  • ngxson 12 hours ago

    And btw, -ngl is automatically set to max value now, you don't need to -ngl 99 anymore!

    Edit: sorry this is only true on Metal. For CUDA or other GPU backends, you still need to manually specify -ngl

    • danielhanchen 12 hours ago

      OH WHAT! So just -ngl? Oh also do you know if it's possible to auto do 1 GPU then the next (ie sequential) - I have to manually set --device CUDA0 for smallish models, and probs distributing it amongst say all GPUs causes communication overhead!

      • ngxson 12 hours ago

        Ah no I mean we can omit the whole "-ngl N" argument for now, as it is internally set to -1 by default in CPP code (instead of being 0 traditionally), and -1 meaning offload everything to GPU

        I have no idea how to specify custom layer specs with multi GPU, but that is interesting!

        • danielhanchen 12 hours ago

          WAIT so GPU offloading is on by DEFAULT? Oh my fantastic! For now I have to "guess" via a Python script - ie I sum sum up all the .gguf split files in filesize, then detect CUDA memory usage, and specify approximately how many GPUs ie --device CUDA0,CUDA1 etc

          • ngxson 12 hours ago

            Ahhh no sorry I forgot that the actual code controlling this is inside llama-model.cpp ; sorry for the misinfo, the -ngl only set to max by default if you're using Metal backend

            (See the code in side llama_model_default_params())

            • danielhanchen 12 hours ago

              Oh no worries! I re-edited my comment to account for it :)

yieldcrv an hour ago

Finally! Open source multimodal is so far behind closed source options that people don’t even try to benchmark

They’re still doing text and math tests on every new model because it’s so bad

gryfft 15 hours ago

Seems like another step change. The first time I ran a local LLM on my phone and carried on a fairly coherent conversation, I imagined edge inference would take off really quickly at least with e.g. personal assistant/"digital waifu" business cases. I wonder what the next wave of apps built on Llama.cpp and its downstream technologies will do to the global economy in the next three months.

  • LPisGood 14 hours ago

    The “global economy in three month is writing some checks that I don’t know all of the recent AI craze has been able to cash in three years.

    • ijustlovemath 14 hours ago

      AI is fundamentally learning the entire conditional probability distribution of our collective knowledge; but sampling it over and over is not going to fundamentally enhance it, except to, perhaps, reinforce a mean, or surface places we have insufficiently sampled. For me, even the deep research agents aren't the best when it comes to surfacing truth, because the nuance of that is lost on the distribution.

      I think that if we're realistic with ourselves, AI will become exponentially more expensive to train, but without additional high quality data (not you, synthetic data), we're back to 1980s era AI (expert systems), just with enhanced fossil fuel usage to keep up with the TPUs. What's old is new again, I suppose!

      I sincerely hope to be proven wrong, of course, but I think recent AI innovation has stagnated in terms of new things it can do. It's a great tool, when you use it to leverage that distribution (eg, semantic search), but it might not fundamentally be the approach to AGI (unless your goal is to replicate what we can, but less spikey)

      • MoonGhost 14 hours ago

        It's not as simple as stochastic parrot. Starting with definitions and axioms all theorems can be invented and proved. That's in theory, without having theorems in the training set. That's thinking models should be able to do without additional training and data.

        In other words way forward seems to be to put models in loops. Which includes internal 'thinking' and external feedback. Make them use generated and acquired new data. Lossy compress the data periodically. And we have another race of algorithms.

        • GTP 6 hours ago

          > Starting with definitions and axioms all theorems can be invented and proved

          This was the premise of symbolic AI, but this approach seems to have been abandoned now.

      • gryfft 13 hours ago

        It doesn't have to be AGI to have a major economic impact. It just has to beat enough extant CAPTCHA implementations.

        • LPisGood 5 hours ago

          We can already do that today

behnamoh 14 hours ago

didn't llama.cpp use to have vision support last year or so?

  • breput 12 hours ago

    Yes, but this is generalized so it was able to be added to the llama-server GUI as well.

  • danielhanchen 14 hours ago

    Yes they always did, but they moved it all into 1 umbrella called "llama-mtmd-cli"!

bsaul 11 hours ago

great news ! sidenote : Does vision include the ability to read a pdf ?

  • diggan 7 hours ago

    Vision = visual, while PDF is a container of sorts, usually containing images and text. So I guess the short answer is: 50% yes, the other part you can use any LLM for.

    • bsaul 6 hours ago

      i'm asking because openai api has a special endpoint to deal with pdf, different from images.

      Which part of a pdf file can you use LLMs for ? Pdf is a binary format..

      • diggan 6 hours ago

        Yeah, that'd make sense, PDFs aren't images.

        PDF isn't really a binary format, it starts with a text header, structure is mostly text-based objects and you can parse many PDFs as plain-text. They tend to contain embedded binary data though, which is the specific part these vision models can help you with, assuming they're images. The rest a "normal" LLM can parse just fine.

mrs6969 12 hours ago

so image processing there but image generation isn't ?

just trying to understand, awesome work so far.

  • a2128 10 hours ago

    As far as I'm aware there are no open source LLMs that can generate images. There's image generation models like Stable Diffusion but those are not transformer language models so they'd be out of scope for the project

  • zozbot234 10 hours ago

    Do the underlying models support generation? If the support isn't there to begin with, the llama.cpp folks can't do anything about that.

  • Rastonbury 10 hours ago

    Generating images using chat seems cumbersome when you can do it directly with something like stable diffusion

nikolayasdf123 11 hours ago

finally! very important use-case! glad they added it!

jacooper 9 hours ago

Is it possible to run multimodal LLMs using their Vulkan backend? I have a ton of 4gb gpus laying around that only support vulkan.

  • buyucu 8 hours ago

    Yes, llama.cpp has very good Vulkan support.

buyucu 12 hours ago

It was really sad when vision was removed back a while ago. It's great to see it restored. Many thanks to everyone involved!

nurettin 13 hours ago

Didn't we already have vision via llava?

  • nikolayasdf123 11 hours ago

    no, it did not work in llama.cpp

    • woodson 6 hours ago

      Slight correction: It worked in llama.cpp via the CLI tools, but not in the llama-server (OpenAI API compatible interface).

    • nurettin 9 hours ago

      I remember it distinctly working.

      • buyucu 8 hours ago

        they deprecated it 1-1.5 years ago. it's not back.