Although cloud-based AI services like OpenAI’s ChatGPT and Microsoft’s Copilot dominate the AI space, local LLMs have also become a popular alternative, especially for those who are tech-savvy or privacy-conscious. Inference engines like Ollama1, LM Studio2, and llama.cpp3 are popular and widely supported by other tools in the ecosystem.
For computers that are capable enough, these engines can use the available processing power and graphics hardware to run LLMs as fast as possible. The capabilities of local LLMs are certainly limited, but you gain the freedom to use them as much as you like without paying the token cost that cloud providers set, which is increasingly important for the token-hungry agentic workflows that are all a buzz.
It’s easy to think that these local AIs are essentially free, aside from the initial cost of the required hardware, but what impact might they actually have on our electricity usage, and how do we relate that consumption to things we are more familiar with?
I took the time to understand this using an actual electricity usage measurement tool, the “Kill A Watt” electricity monitor made by P3 International4.
Here in Canada we pay about 16.5 cents per kilowatt-hour (kWh) for our electricity on average, though rates vary across provinces5. The typical household uses about 900 kWh per month, which amounts to a bill of about $150 CAD, or $110 USD. I have a modern refrigerator that I measured, and it only consumes 0.78 kWh in 24 hours, or $0.13 CAD a day, $3.90 a month. I also measured my kettle at 1340 watts while boiling 500ml of water (about 2 cups), but that only takes about 3 minutes, which adds up to 0.07 kWh, or $0.012 CAD. A typical LED light bulb will consume 9 watts of electricity which adds up to 0.144 kWh if left on for 16 hours a day, or $0.024 CAD a day, $0.72 a month. These are relatively small costs, but they add up, and household’s usage is usually dominated by water heating, space heating or cooling, and washer/dryer usage.
Though all of these watt and dollar numbers give us a general sense of the cost of appliances, it may be most useful to think about energy usage in terms of light bulbs, since they are the most visual form of electricity use. A typical modern LED light bulb is very efficient. It converts 90% of those 9 watts into light. If we imagine a grid of 12 by 12 light bulbs (144 in total), and we turn that grid on for 3 minutes, that’s almost equivalent to boiling 2 cups of water.
I also wanted to measure typical PC energy usage as a baseline while idle, watching a video, or playing games. Here are those numbers:
| Scenario | Watts | Equivalent # of light bulbs | CPU Usage | GPU Usage |
|---|---|---|---|---|
| Idle Windows PC with no significant applications running | 46 | 5.1 ![]() ![]() ![]() ![]() ![]() |
0 | 0 |
| Watching a YouTube video, fullscreen at 1080p | 56 | 6.2 ![]() ![]() ![]() ![]() ![]() ![]() |
2 | 5 |
| Playing Overwatch at 1080p, high quality graphics | 112 | 12.4 ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
15 | 30 |
| Playing Diablo 4, in “Performance” mode | 168 | 18.7 ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
55 | 39 |
With this in mind, I set out to measure a set of LLMs running under Ollama, using the Kill A Watt monitor. I ran these tests on my desktop PC, running Windows 10, with a Nvidia RTX 4070 Ti with 12GB of video RAM, and an Intel i7-4790K CPU with 4 cores. I gave the models the task of translating text, 900 words extracted from a public domain book from Project Gutenberg, from English to French.
# Using Ollama to run this task with each model,
# injecting the sample text into the command before the prompt.
ollama run <model> "$(cat sample.txt)" Translate to French
| Model | Watts | Equivalent # of light bulbs | Time (seconds) | CPU Usage | GPU Usage |
|---|---|---|---|---|---|
| granite3.3:2b | 265 | 29.4 ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
7.24 | 15 | 95 |
| gemma3n:e4b | 246 | 27.3 ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
19.52 | 25 | 90 |
| qwen3:8b | 280 | 31.1 ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
24.13 | 19 | 100 |
| glm-4.7-flash:q4_K_M | 130 | 14.4 ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
294.54 | 34 | 50 |
Monetarily speaking, these simple tests are just fractions of a cent, but I do find it useful to think about the energy cost of this task as the equivalent of turning about 20 LED light bulbs on for about 30 seconds, depending on which model you happen to use. With agentic workflows taking many minutes, or even hours, you can imagine the energy usage racking up.
One of the side effects of running AI locally is that the energy impact becomes more direct and apparent. Anyone who has done serious work with local AIs will know that their PC effectively turns in to a space heater, since CPUs and GPUs give off a lot of heat when they are pushed to their limits, and most models utilize 90+% of a GPU’s processing power.
Even then there are externalities to consider. If local AI usage were far more common, households could see increasing electricity use over time, putting a strain on power grids or increasing demand and electricity rates, as we’ve already seen in U.S. cities with new data centres6. Then there’s the matter of the source of this energy. Most of Canada is fortunate to have a large supply of hydroelectric and nuclear power, but your state or city may not be so lucky.
We’re so used to cloud services now, that we don’t think twice about the energy impact of using a website or app. The trade-off was reasonable with conventional software, based on well-optimized algorithms. But AI flips that trade-off on its head as we seek dubious productivity gains with AI models that are inherently power hungry. This effect is multiplied by the big AI companies hiding or subsidizing the true cost of their operations. Despite the enormous valuations and billions being thrown around, OpenAI is expected to run out of money in 18 months7.
It’s difficult to do, but when we use AI we ought to integrate these externalities into our mental models. The energy cost, the copyright lawsuits, the exploited data workers8, AI psychosis, and even the AI-washing layoffs9. The rapid rate of adoption has already exceeded our ability to keep up with the downsides.
The next time you use AI, maybe imagine a bank of light bulbs being turned on for the duration of your query in some anonymous data centre on the outskirts of your city.
]]>I’ve generally been skeptical about the claims made about LLMs and their ability to reason. They’re certainly capable of many useful things, but it seems that much of the hype is based on a theoretical AI that does not yet exist. Talking heads across the board including the big AI companies, VC pundits, media interviews, entrepreneurs, and technical folk all quickly make the jump from the limited capabilities of today’s LLMs, to some future AI that will truly be a general replacement for human labour.
It should be noted that I’m referring to the raw abilities of LLMs. There are certainly many techniques you can layer over LLMs to overcome some of their limitations, but I care about the core capabilities of these systems, because I think they set a baseline that must be recognized and understood.
My own opinion has flip-flopped several times over the past year, but I think I’m now leaning more towards the view that Melanie Mitchell12, François Chollet3, Murray Shanahan4, Subbarao Kambhampati5, and other AI realists espouse – that we’re all tricking ourselves into thinking that LLMs are more than what they are through some combination of projection, anthropomorphization, and a Clever Hans effect6. We really want the AI of our dreams, so we’re quick to imbue LLMs with these capabilities, especially since they do truly demonstrate many useful traits. LLMs are also trained on vast amounts of data, so they are super-human in that sense. No one person can draw from that amount of knowledge, so of course LLMs seem intelligent to us. They are extremely well-read.
More recently, people are understanding that we cannot compare the intelligence that LLMs demonstrate to human intelligence. Even through LLMs mimic human intelligence very well, it’s easy to forget that the underlying mechanisms are entirely different. This is now being called a “Jagged Frontier” or a “Jagged Intelligence” – AIs that seem to display human-level intelligence in some aspects, but fail miserably in other aspects178. Unlike human intelligence, where we can expect a baseline of generality and reasoning, LLM capabilities are unpredictable and they need to be verified empirically for each category of tasks we want them to perform.
One example of this kind of failure is simple to demonstrate. LLMs, despite all their surprising abilities, are poor at counting. They cannot reliably count paragraphs in articles, words in paragraphs, or even letters in words. Even the state of the art LLMs cannot perform this task consistently. For example, if you ask an LLM “How many times does the letter ‘r’ appear in the word ‘barrier’?”, there’s a good chance it will respond with “There are two ‘r’s in the word ‘barrier’.”
It’s not difficult to guess why LLMs fail at this task. LLMs are feed-forward systems that process text in chunks. Although they can understand the gist of an instruction through concepts they form in their neural hierarchies, they do not construct plans for carrying out instructions, nor do they have a working memory to perform tasks like counting the way humans do. Instead, LLMs draw from the vast number of examples they are trained on, and they interpolate through those examples to produce an approximate retrieval9 of an answer to a prompt.
An answer like “There are two ‘r’s in the word ‘barrier’.” is also understandable when you think about LLMs as systems that are designed to mimic human responses. It’s not hard to imagine that there exist spelling corrections on the Internet, and in the LLM corpus, where one person corrects another, reminding them that there are two ‘r’s in the word barrier, because it is implicit that “barrier” ends with an ‘r’, so of course we humans focus in on the ‘r’s in the middle of the word and remind each other that, “actually barrier has two ‘r’s in it”.
Although these failures are understandable in retrospect, they can be surprising when you’ve convinced yourself that LLMs have human-level or super-human intelligence. Nonetheless, we should not simply overlook these failures as quirks or trivial examples because they reveal a core difference in intelligence that optimists should be aware of.
We have a plethora of benchmarks that we test new LLMs with – general knowledge tests (MMLU), math tests (MATH), and common sense tests (HellaSwag). We’ve become used to seeing these benchmark names and acronyms in LLM release announcements but they all feel quite opaque to me. What are they actually testing, how reliable are they, and what do they signify from a human point of reference?
This letter-counting test feels much more approachable to me. It’s immediately apparent that humans can perform this task at close to 100% accuracy, and that an LLM would need to have some baseline of reasoning capability to perform it reliably and in a generalized manner.
To that end, I sought to benchmark various LLMs with this test, and wrote some code10 to do so. Using ollama11 and groq12, I tested several open-source LLMs, measuring their ability to count repeated letters in random words, sampling each LLM several times to measure their accuracy, along with a standard-deviation measurement.
Specifically, the prompt is:
How many times does the letter "L" appear in the word "WORD"?
Respond with JSON in the format { "count": N }
The results are as follows:
| runner | model | success rate | stddev | model size |
|---|---|---|---|---|
| ollama | gemma2:9b | 68.24 | 1.55 | 9 |
| groq | llama3-70b-8192 | 66.67 | 2.49 | 70 |
| ollama | llama2:13b | 59.68 | 3.32 | 13 |
| groq | llama-3.1-70b-versatile | 51.66 | 4.50 | 70 |
| groq | mixtral-8x7b-32768 | 46.00 | 2.83 | 56 |
| ollama | llama3:8b | 44.08 | 3.25 | 8 |
| ollama | mistral-nemo:12b | 42.72 | 3.21 | 12 |
| ollama | phi3:14b | 35.52 | 2.99 | 14 |
| ollama | deepseek-coder-v2:16b | 33.04 | 2.70 | 16 |
| ollama | llama3.1:8b | 26.40 | 1.24 | 8 |
| ollama | gemma:2b | 22.16 | 2.00 | 2 |
| ollama | qwen2:7b | 20.56 | 2.76 | 7 |
| ollama | gemma:7b | 10.24 | 1.73 | 7 |
This demonstrates several interesting findings. Firstly, even the best models cannot perform the task reliably. Benchmarks like this are important since you’d be tricked into thinking that an LLM had this capability if you tested it by hand. Secondly, when it comes to LLMs, bigger is not always better, depending on the task. This is Jagged Intelligence in plain sight. gemma2:9b performs better than the state of the art llama-3.1-70b-versatile which is almost 8x bigger, and oddly gemma:2b is 2x better than gemma:7b. Evidently, adding more parameters or more training data into the mix can sometimes cause LLMs to regress in basic capabilities.
There are certainly more directions you could take with a benchmark like this. I might continue to test it against the latest cloud-based LLMs (GPT-4o, Claude Opus, Gemini Advanced, etc.). I chose to use local LLMs for the most part, because, similar to the ARC Challenge13, I think it’s important to measure LLMs that aren’t entirely opaque and can be run independently to ensure there aren’t extra factors at play. Another obvious next step would be to prompt engineer the task to see what gains you could scrape out of it.
Even if you’re a proponent of the benefits of AI or LLMs, it behooves us to know their limitations well, so that we can utilize them effectively. Perhaps we are on the cusp of AGI, but I think the breakthroughs required to go from today’s LLMs, to machines that can truly reason are not so obvious. Only time will tell if we’re in for a hard takeoff, or another AI winter. It’s important to sharpen our skeptical minds when we’re immersed in a hype bubble. We need to catch ourselves when we make those quick but giant leaps between what we have at hand, to the future we all dream of.
]]>