Sunday Coffee & Code: trying to measure model confidence a bit more honestly - an afternoon with Claude Code and token logprops

I have been a bit uneasy for a while about asking a model to rate its own confidence (at least it’s a data point). It can be useful, but it also feels a bit circular. If the same system can hallucinate the answer, it can also hallucinate the confidence score. So this weekend I went down a logprobs rabbit hole. The goal was to see if I could get to something a bit more grounded. Not whether the model says it is confident, but what the token probabilities suggest mathematically - a real score and some insight. That took me into token-level logprobs, weakest-link scoring, risk density, runner-up token margins, and what I ended up thinking of as a kind of hallucination snowball effect. A few things stood out. ● 𝘍𝘪𝘳𝘴𝘵, simple 𝗮𝘃𝗲𝗿𝗮𝗴𝗲𝘀 𝗰𝗮𝗻 𝗳𝗹𝗮𝘁𝘁𝗲𝗿 𝗮 𝗿𝗲𝘀𝗽𝗼𝗻𝘀𝗲. A lot of high-confidence structural words can make the overall score look healthier than it really is. ● 𝘚𝘦𝘤𝘰𝘯𝘥, the 𝗴𝗮𝗽 𝗯𝗲𝘁𝘄𝗲𝗲𝗻 𝘁𝗵𝗲 𝗰𝗵𝗼𝘀𝗲𝗻 𝘁𝗼𝗸𝗲𝗻 𝗮𝗻𝗱 𝘁𝗵𝗲 𝗻𝗲𝘅𝘁 𝗯𝗲𝘀𝘁 𝘁𝗼𝗸𝗲𝗻 was much more interesting. A token picked at 52% with a runner-up at 48% tells a very different story from 52% with no serious rival. ● 𝘛𝘩𝘪𝘳𝘥, once the model commits to an uncertain token, it can sound very confident for the next stretch because it is now building on its own guess. That does not necessarily mean it is right. It may just be confidently continuing in the wrong direction - 𝗮 𝘀𝗻𝗼𝘄𝗯𝗮𝗹𝗹 𝗲𝗳𝗳𝗲𝗰𝘁. Snippets of the analysis and histograms attached. I also hit a few practical bumps along the way, including Ollama version compatibility, model support for logprobs, and a few API quirks while testing different models. Those wrinkles were part of the exercise too. In the end, I tested the approach with Ollama/llama3.1:8b and OpenAI/gpt-4o. One thing worth keeping in mind: 𝘁𝗵𝗶𝘀 𝗱𝗼𝗲𝘀 𝗻𝗼𝘁 𝘁𝗲𝗹𝗹 𝗺𝗲 𝘄𝗵𝗲𝘁𝗵𝗲𝗿 𝘁𝗵𝗲 𝗺𝗼𝗱𝗲𝗹 𝗶𝘀 𝗳𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗰𝗼𝗿𝗿𝗲𝗰𝘁. 𝗜𝘁 𝘁𝗲𝗹𝗹𝘀 𝗺𝗲 𝗵𝗼𝘄 𝘀𝘁𝗮𝘁𝗶𝘀𝘁𝗶𝗰𝗮𝗹𝗹𝘆 𝗰𝗲𝗿𝘁𝗮𝗶𝗻 𝗶𝘁 𝗮𝗽𝗽𝗲𝗮𝗿𝘀 𝘁𝗼 𝗯𝗲 about the response it generated. In other words, how measurable that confidence is, and where there may be a stronger risk of measurable hallucination. That feels like a better signal for use in my multi-agent systems as a circuit-breaker of sorts for routing, review, and guardrails than simply asking the model to mark its own work. (https://github.com/steveh250/logprobs)

Sunday Coffee & Code: trying to measure model confidence a bit more honestly - an afternoon with Claude Code and token logprops

Want to Discuss This Topic?