Laingsoft

Fundamentally throwing good software at problems.


Nemotron-3-nano on a Tesla M10

Incredible results

Forgive me for not being all up on the latest AI news. I find it rather exhausting to think about. Perhaps it’s my advanced age that is slowing me down, but I don’t have time to sit and read about all the comings-and-goings of the latest tech bubble, So I apologize if this is already known, but this is my experience working with it.

Over the last few months I have been playing with self hosting my own server. Which is a Lenovo X3650 M5, with a Tesla M10. I am currently waiting on another PCIE riser so that I can put a second M10 in the server, which in total should give me 64gb of (admittedly slow, hey it’s gddr5!) vram to play with. In the meantime I have been looking for models that will fit within the constraints of my 32gb card. My favorite so far have been the qwen2.5 series, both qwen2.5 and qwen2.5-coder have given me good results, although nothing near what I can expect to get from something like codex or claude.

Nonetheless I was browsing around ollama’s models list and came across nemotron-3-nano. I installed it on a lark, assuming that my geriatric old card would simply choke on the model, however I was completely wrong. It actually works better than almost any other model I’ve tried. In fact, it get’s over 15 TIMES more tokens/s than a 12b model!

token stats

Although it seems that nemotron is a rather “thinky” model. It seems to have a tendency over overthink, but the results are pretty good. In the above prompt, I asked it to create a hello world, but through an endpoint called “/hello” that returned “world” in python. I also asked it to do the same in erlang, and it did give me a very detailed, very wordy explanation.

It seems like a trick with this model is that you need to give it as much context as possible. Any sort of question and the model just kind of goes into a recursive loop of self-doubt.

smi

Here’s proof that the model fits on one card, with a tiny bit of room to spare.

I’ve also tried connecting the model to opencode, using the following config:

        "nemotron-3-nano:30b": {
          "name":"nemotron-3-nano:30b",
          "tools":true
      }

The results were decent. As I said above, the model seems to get a little confused when it comes to actually using the tools that opencode provides.

model messing up

Perhaps not the best model for agent work, but then again, I am so green with AI stuff that I really need to read more. I can’t really explore using it as an agent at the moment, because my server doesn’t seem to recognize the Tesla’s temperature sensors, and take them into account when selecting the fan speed. In effect, this means that I can only really have a single prompt running for a few minutes at a time before the card starts to overheat. At which point the IMM2 freaks out and reboots the server -_-

Bottom line: Very impressed with performance, will probably use this model preferentially when vram is not in short supply. Once I have multiple tesla cards, I can definitely see using this model as “the brain” and having another model do embeddings, or TTS, or something else.