My Notes on Running LLMs Locally (Part 1)

As much as some people I know and love want to pretend they don't exist, LLMs are a thing and they're unfortunately not going anywhere. I'm a believer in knowing your enemy, so if this shit is going to take my job I should at least try to understand how to seize the means of production - i.e: self-hosting an LLM.

There's so much terminology new to me so I thought I'd cook up a blog post (not written by an LLM, I can write boring blog posts all by myself!) on what's involved in running LLMs on your own hardware so we aren't totally trapped in Silicon Valley's Skinner box. This post will focus on trying to understand what impacts LLM performance locally and I'll write a second post on the practical side of actually running an LLM locally.

And before you ask, yes, the amount of capital and resources sucked into building datacentres for LLMs scares me, the prospect of general computing becoming out of reach for the average person is worrying, the societal impacts of mass job loss due to LLMs isn't being taken seriously by governments, and so are the many other externalities of widespread LLM use.

As usual with these types of blog posts, I am not an expert, I am just summarising what I think I've understood from what I've read on the internet. I could be vastly oversimplifying or exaggerating, but I don't know any better. Feel free to contact me if you think I should make a correction.

LLM Basics

Training an LLM - feeding the computer loads of information and turning the information into datasets, then turning the data into "tokens" (more about tokens later) and placing them into tensors that a transformer manipulates (I have no idea how to explain tensors and transformers, way out of my league) - is mostly out of scope for us plebs with a handful of GPUs.

When you chat with an LLM, you are doing "inference". It's called inference because the LLM is "inferring" which words, or code, or pixels in an image, should happen next. A core concept of LLMs is that it's all a giant pattern matching machine using probability to decide what pattern to show you.

When you type into an LLM chat session (aka prompting), or upload a file to the LLM to digest along with your prompt, those words are converted into tokens that are then converted into numbers that the LLM can do stuff with. Tokens are difficult to measure as a single word is not necessarily a token. A good explanation of how LLM input is converted into tokens and the cost per token is in this corporate blog post from TokensTree. Each model has a different way of converting words into tokens. The same sentence could be 100 tokens with Claude or 120 tokens with ChatGPT.

Tokens per second is a measurement of how fast an LLM interprets what you told it (aka prompt processing) and how fast it spits back an answer. Apparently around 5 to 10 tokens per second is about the same speed as reading, so for a roughly interactive chat session, that's kinda the benchmark for performance. If you feed the LLM a massive file, it has to convert it into tokens before it can do anything with it. There's loads of different tokenisation techniques, but approximately 3-4 characters is one token when it comes to digesting code.

If you feed the LLM a 10,000 character PHP file for example, it's going to be 2900-3000 tokens. But prompts are processed in parallel, whereas output is sequential, so feeding an LLM stuff is much faster than getting stuff out of it. If it spits back a 12,000 character PHP file along with an 800 word description of what it's done, that's going to be like 6,000 tokens, which at 30 tokens per second will take about 4 minutes to appear. The more tokens per second the system can churn through, the faster your results after each prompt.

What Impacts LLM Quality?

Models are the thing we're talking about here and there is so much jargon used to describe them and the differences between them. There's so much more to this topic than what I've explained here. This is just the tip of the iceberg and new stuff is coming out every few weeks.

Parameter Size

In a big model such as Qwen3.5-397B-A17B, the parameter size is 397B - that's 397 billion parameters, I don't quite understand what exactly a parameter is but I know it is a number and it's related to the data the model is trained on. The more parameters a model has the more stuff is in it for the model to reference when you ask it something. The more stuff it has, presumably, the better it can guess what to tell you.

The more parameters, the more memory the model requires to run, because the entire model needs to be loaded into memory for it to work. Running an LLM off a disk is far too slow, it has to all happen in memory. 397 billion parameters, stored in 16-bit floating point format, take up 2 bytes each. That's a whopping 794GB of data sitting in RAM for Qwen3.5. There's lots of tricks employed to lower the amount of RAM needed (with various trade-offs), but in general the more model parameters, the more RAM you need.

Context Window

You also need RAM to store the context window, which is stored in the key value cache. The context window is basically the prompts you give and the output the LLM provides. The bigger the context window, the longer of a conversation you can have with the LLM before it forgets what you told it. The KV cache is, I think, a cache of previous tokens so it doesn't have to process them again, saving time.

Each model handles tokens differently, so the amount of memory a token consumes depends on the model. Working it out is a long-ish formula (see this blog post), but for Qwen3.5-397B-A17B, it is 122,880 bytes per token, which if you fill up the native context window of 262,144 tokens (this model can go up to 1m tokens but I couldn't figure out how), that's 32GB of RAM just for remembering a session.

Mixture of Experts

In Qwen3.5-397B-A17B, the A17B in the name means it is a "mixture of experts" model. It has 397B total parameters but only 17B of them are active per token. Instead of using the entire 397B, a routing network selects a small set of experts to handle each token, and their combined parameters total roughly 17B.

You still need the entire 397B parameters in RAM (either all in the one GPU/CPU or across GPUs/CPUs), but the benefit of MoE is that the GPU only needs to churn through 17B parameters to get an output instead of 397B, speeding things up. Apparently MoE gets pretty close in accuracy to using the entire model, but it depends on how good the routing is.

Quantization

You've probably figured out by now that LLMs need a shitload of RAM. One way to use less RAM is to use a lower precision parameter. Parameters are just numbers, but the more precise the number the more accurately the prediction machine can predict. If you're willing to accept less accuracy in your predictions, you can use less precise numbers to represent the same parameter.

The "full size" parameters are usually stored as 16-bit floating point numbers, predominately bfloat16/BF16. 16-bits is 2 bytes, so each parameter needs 2 bytes of RAM. If you reduce the precision down to 8-bits, you half the amount of memory. Use a 4-bit floating point number and you're using even less RAM. I don't know how this impacts the LLMs output in the real world, but there's some pretty smart quantization methods that aim to reduce precision in a way that retains accuracy where it matters the most.

Your CPU/GPU/NPU also needs to be able to support whatever type of floating point operations (FLOPS) your chosen LLM is quantized to so it can process the parameters without having to convert them the number to a type of FLOP it handles natively. The Hugging Face community takes the base models and does quantization of various types. Just look at all the quantizations of Google's Gemma-4 since it was released just a week ago.

Reasoning

This is pretty hard to explain without anthropomorphising the LLM, but reasoning models try to think through a problem before giving an output, by having their own internal monologue. If you've used Opus 4.6 for example, you'll see the steps it takes in trying to formulate an output for you. It can be useful for coding problems that have multi-step requirements.

The downside of reasoning is that it sucks up more tokens (so more of your context window and KV cache!) and takes longer to provide an output, so if you're using a slow piece of hardware, reasoning will make you wait even longer to get an output - but it should, in theory, be more thorough. Reasoning models are also more prone to hallucination or trying too hard.

Fine tuning & reinforcement learning

Training an LLM is serious business, but you can tweak existing models to make predictions more in line with what you expect. Fine tuning is, I think, kinda like forcing the model to behave a certain way. Reinforcement learning is more like encouragement, egging the LLM on by approving or disapproving certain outputs.

What Impacts LLM Performance?

GPUs/CPUs/NPUs are really good at floating point operations. The latest Nvidia B200 GPU can do 2,250 trillion floating point operations per second at BF16 precision. The Apple M5 Max SoC in a laptop I can go down to JB Hi-Fi and buy today can do 50-ish FP16 TFLOPs. Loads of devices have NPUs or AI accelerators in them too, but they may not support BF16 precision or if they do, it's 0.5 or less TFLOPs.

Inference only requires a handful of FLOPs to process each token against each parameter and a device capable of 1 TFLOP of BF16 performance can do a trillion operations a second, so the bottleneck for inference isn't the GPU speed, it's mostly memory bandwidth, particularly for large (hundreds of gigabytes) models.

For every token the LLM outputs (i.e: responses to your prompts), it has to read the weight of every single parameter. If you've got 794GB of parameters sitting in RAM, the computer has to take that 794GB, scan all of it and then do it again for the next token. A 3,000 token output with a 794GB model is 2.38PB of data transfer! If your system is capable of 200GB/s of memory bandwidth, it'll take 11,910 seconds (over 3 hours) to process.

A pair of DDR5-5200 DIMMs manages around 80GB/s of memory bandwidth. On a 794GB model, that's 10 seconds of shuffling data around in RAM while the slowest of NPUs sits there waiting 90% of the time just to spit out a single token. The fancy HBM memory in Nvidia's GPUs is capable of up to 8TB/s, making inference incredibly quick. There's very few scenarios where the GPU/CPU/NPU is 100% occupied as RAM speed will always be the bottleneck.

Not only do you need fast RAM, you also need lots of it to hold the parameters and the context window. If the full set of parameters aren't all in RAM, you'll be swapping between disk and RAM and that sucks. If you want to use a big model and/or a large context window, you need hundreds of gigabytes, which also has the penalty of shuffling all that data between RAM & GPU - the smaller the model and context window, the less shuttling between RAM & CPU/GPU because there's just a smaller amount of data, but the trade-off is less "knowledge" in the model or a smaller context window.

There are techniques to require less RAM, like quantization of the model and there's techniques to require less shuffling of data in RAM, like mixture of experts, but the overall theme of LLM performance is get as much RAM as possible that's as fast as possible. There's another technique, tensor parallelism, that allows you to run an LLM over multiple GPUs/CPUs, effectively splitting up the model and running smaller chunks in parallel. You can even connect multiple computers together over a high bandwidth link (400gbit networking baby!) and spread the load. Each technique has its caveats though, like reducing accuracy, introducing hallucinations and stuff like that.

Enjoyed this blog post? You'd love The Sizzle.

It's an Ausse take on tech news that I write daily. No ads, no tracking, no affiliate links, no AI slop. 100% subscriber-funded since 2015.

Learn more ↪