Ask HN: How expensive are LLMs to query, really?

5 points by teach 6 hours ago

I'm starting to see things pop-up from well-meaning people worried about the environmental cost of large language models. Just yesterday I saw a meme on social media that suggested that "ChatGPT uses 1-3 bottles of water for cooling for every query you put into it."

This seems unlikely to me, but what is the truth?

I understand that _training_ an LLM is very very expensive. (Although so is spinning up a fab for a new CPU.) But it seems to me the incremental costs to query a model should be relatively low.

I'd love to see your back-of-the-envelope calculations for how much water and especially how much electricity it takes to "answer a single query" from, say, ChatGPT, Claude-3.7-Sonnet or Gemini Flash. Bonus points if you compare it to watching five minutes of a YouTube video or doing a Google search.

Links to sources would also be appreciated.

serendipty01 6 hours ago

Some links:

https://www.sustainabilitybynumbers.com/p/carbon-footprint-c...

https://andymasley.substack.com/p/a-cheat-sheet-for-conversa...

(discussion on lobste.rs - https://lobste.rs/s/bxixuu/cheat_sheet_for_why_using_chatgpt...)

(discussion on HN, 320 comments: https://news.ycombinator.com/item?id=42745847)

teach 5 hours ago

These are excellent, thank you!

a_conservative 4 hours ago

my m4max macbook can run local inference on a medium-ish gemini model (32b IIRC). The power consumption spikes by about 120 watts over idle (with multiple electron apps, docker, etc). It runs about 70 tokens/sec and usually responds within 10 to 20 seconds.

So.. picking some numbers for calculation. 4 answers per minute @ 120 watts is about .5 watt-hours per answer. ~200 responses would be enough to drain the (normally quite long lasting battery).

How does that compare to the more common nvidia GPUs? I don't know.