Sunday Coffee & Code: Small Models, Prompt Injection, and Fine-Tuning

Today’s experiment turned into a bit more of an odyssey than I expected.

On paper, simple enough:

Test a lighter-weight multilingual model as an alternative to Granite4 3B
Run it through the 500 prompt injection attack set to get a baseline
Fine-tune it using the new Unsloth AI Studio tool
Rerun the test and compare the results

I settled on Google 𝗚𝗲𝗺𝗺𝗮𝟯 𝟭𝗕 for the model.

𝗜𝗻𝗶𝘁𝗶𝗮𝗹 𝗿𝘂𝗻 The first pass was not really great. Gemma3 1B did not perform particularly well against the prompt injection test set. It did one thing better than I expected, it identified the non-English malicious attempts. I was hoping its multilingual support would help there, and it seemed to.

𝗨𝗻𝘀𝗹𝗼𝘁𝗵 𝗦𝘁𝘂𝗱𝗶𝗼 Getting Unsloth Studio installed was the easy part - a one-liner, just as advertised. Getting it running cleanly in my environment was more complex. It looked like I had some older CUDA-related baggage lying around on my test box, with issues surfacing around NVIDIA libraries, LD_LIBRARY_PATH, torch, dynamo, config changes, triton, and a few package version mismatches. In fairness, this machine is a bit of a dumping ground for experiments, so I’m not shocked that I (Claude) had to wrestle it into shape. Once it was running though, the experience was pretty smooth. For a beta tool, Unsloth Studio was good. Most of what you need is right there in a single pane, especially compared to how painful this was when I scripted this last. Also worth noting: the server logging is excellent. Being able to inspect the console output made it much easier to spot failure points and work through them.

𝗥𝗲𝗿𝘂𝗻 After training, I built a Modelfile, added the GGUF to Ollama, and reran the attack set. That surfaced another interesting issue: some prompts were taking much longer than expected to process (~4 minutes). So while the overall workflow worked, I’ve now got another thread to pull on around inference behaviour and latency.

𝗦𝘂𝗺𝗺𝗮𝗿𝘆 I’d still call today a success - not because everything worked cleanly first time, far from it, but because the process held together end to end. Some things to ponder:

Starting again, I’d probably start with the Docker image
I need to better understand the relationship between the training dataset, the fine-tuning process, and inference prompt formats
Why some prompts took so long? Some Sundays are about clean wins, some are about fighting the stack, learning where the edges are, and coming away with a much better feel for the process. Today was definitely the second kind.

𝗦𝗼𝗺𝗲 𝘀𝘁𝗮𝘁𝘀: Base model: 48.6% accurate - nearly all benign flagged as malicious Tuned model: 51.8% accurate - slightly less benign incorrectly flagged A lot to fix.

Sunday Coffee & Code: Small Models, Prompt Injection, and Fine-Tuning

Want to Discuss This Topic?