Setting up local models is kind of a pain.

But it is pretty fun.

Posted on 2026-01-14 in self-hosting, AI, systems • 776 words • 4 minute read

Tags: AI, self-hosting

As of late, one of my nasty habits has been browsing a small tech recycler in my province called retail.era. I keep finding some new bits of used hardware that I can’t help myself from getting. MicroPC to run my print server? add to cart. 1gb switch for $10? Add to cart. IBM x3650 m5 2u rackmount server for $25? HELL YEAH.

As with my latest kick on putting linux on everything I own, I’ve been getting some of these commodity devices and throwing debian on them to serve single use purposes. Is it good for my electricity bill? uh, probably not. Is it worth it? Probably not as well.

I’ve set up all my computers with some version of debian. My garage PC (named BigTony) is mainly a music box, and it also serves to search up service manuals when needed. I have a small print server running klipper and pihole, and my main server, which runs nginx serving this website, some media servers, and image server, and a few other things. Among them is also my ollama server.

I had originally set up ollama to run on CPU only. The server itself has two Xeon e3’s with horrible 1.something ghz clock speeds. I wasn’t getting too bad inference on small models. 2.5b models ran at around 3 t/s or so.

I was browsing facebook marketplace, as one does, and I came across someone selling 2x Nvidia M10’s. That works out to essentially 8 GTX 750ti’s, which for me, sounds like a pretty sweet deal. I realize that this hardware is ancient, but I bought it anyways. I can always update the cards later, plus it’s more likely that something else in my rig is going to bottleneck the card anyways. I picked up the two cards, headed home and went to throw them in the server. Problem #1 appeared. The server has a strange power connection, not a normal 8-pin. So off to ebay to buy a power cable. In addition, I also needed to get another PCIe riser to support the second card. No problem, ordered.

The cables came, and I installed one of the GPU’s. That was when some of the issues started. The server is running debian 13, looking through the documentation on installing nvidia drivers. I spent hours and hours trying to figure out which drivers to install. Because it was a tesla card, the drivers themselves were in a different package. Following the guide installed a very old version of the drivers, which pegged the cuda version to 11, not going to work.

eventually, what I found was that I could simply use

sudo apt install nvidia-tesla-535-driver

Which fixed all my issues, and got me a working card.

On to installing ollama. This one is pretty straightforward, simply use the instructions on the homepage to install it. However, I needed to move the location of the models, and I did so by editing the /etc/systemd/system/ollama.service file, and adding the OLLAMA_MODELS variable (as well as the OLLAMA_HOST, to expose it to my LAN).

Once I had that setup, I started exploring models. My results for one card are below:

Model	Eval Rate
gemma3:12b	4.03 Tok/s
deepcoder:14b	3.48 Tok/s
qwen2.5-coder:3b	11.44 Tok/s
devstral-small-2:24b	2.27 Tok/s

For each model, I gave it the prompt: “Write a hello world in C, Do not explain yourself, simply output the code and nothing else.” The results were pretty neat, I have to say. I am very impressed by the results with qwen, even though it is a 3b model, and it falls apart really quickly. It does seem to do things like understand diffs and such fairly well. Generating code and one-shotting, not so much.

The result with Devstral-small-2 was interesting. It’s a 24b model, which is the largest I’ve tried, but it’s still fairly performant on my old ass hardware.

I know what you’re thinking… “Why not just buy a RTX 4060 or something, even a 3060 would probably perform the same” And the answer is, yeah, you’re right. However, can you get all of that performance, with 96gb of RAM (right now, that ram would probably pay my mortgage) for $125? I think not.

As for my work right now, I’m waiting on my pcie riser to come. It appears to be stuck in transport hell. Once I get another card in there, my plan is to see how many small models I can run at once and make some sort of chatbot battle arena.

If you have any questions, or would like to get some help getting your Tesla M10 cards set up, drop me a line on one of my socials!

Laingsoft

Fundamentally throwing good software at problems.

Setting up local models is kind of a pain.

But it is pretty fun.