Confidence In Reasoning of LLM’s (Large Language Models)

Hello Readers. Welcome back to our series of interpretable blogs!!!

Today we are going to discuss about Large Language Models (LLM’s) and some conclusions about accuracy of their reasoning. So, as usual we will start with complete zero. let's first understand about both the terms, Large Language Models and Reasoning in nutshell!!! So, Large language Model is simply a system which is trained on massive amount of “text” data, and which is able to answer question of user in systematic way. Again, in the question arises… The word i used “Systematic”, is different for every domain. So how LLM decides the structure of answer depending on question? From where this intelligence comes?
Yes, that’s right question!!!
So, for every domain there is different corpus of data. That data is then presented in systematic format by domain experts and then LLM’s are trained on that. When question is asked by user, that question contains some technical words then those words are matched with available corpus of data inside LLM. Once domain of question is decided, answer is presented by LLM systematically. As we can see this system is not running on computer program. It is carrying property learn from some data and take decisions. Right?
Yes!!!
That's why it’s AI system!!!

ConfidenceInReasoningLLM-image1

Now, coming on a similar note, lets understand about Reasoning!!!
So basically, to perform any task there are some defined theoretical rules which are there in every subject / domain which are again developed by lot of research and experiments by scientists in respective fields.
For better understanding, let's take very basic example.
Suppose person A has to travel from India to USA.

So, for that, there is defined roadmap / procedure for going to USA from India. Isn’t it?
Person A will most probably come across the following questions eventually.

What is our financial budget? Which Airline we are going to choose for going to USA? How much time we will take? Who is going to receive us at airport? And so on? there can be many questions as this being completely subjective case.
Now what will happen is Person A will look at the specific / standard procedure everyone is following for going to USA from India and then depending on his needs he will use information of that procedure according to him but without breaking the boundaries of standard procedure!!!

This is Reasoning!!!
In technical words, this standard domain knowledge / procedures are termed to be “knowledge base”
So, Use of Knowledge Base with maximum possible flexibility for getting solutions by maintaining the boundaries of knowledge base is “Reasoning”.
So readers, now let's Start with the white paper by University of Oxford & Karolinska Institute on Confidence in Reasoning of LLM’s.

1. Abstract –

LLM’s are often used with very high majority by IT people and even by non-IT-industry people with different domains. So, it's common to have question on accuracy of this LLM’s.
In this white paper, confidence is estimated on two levels:

Qualitatively, in terms of how LLM’s are keeping their response same even the answer is asked to re-check.
Quantitatively, in terms of self-reported confidence score from 0 to 100.

For estimating confidence on above two levels, LLM models used in white paper are GPT4o, GPT4-turbo, and Mistral.

Media Summary –

Research on LLM showed that LLM perform similarly with each other and significantly greater than any random guessing we normally do.
But unfortunately, tendency of LLM’s to change their answer for same prompt varies from 13% to 98%. This percentage change may vary by + – 5 % in different LLM models / different tasks.

It is found that there is strong +ve correlation between Confidence & Accuracy.

2. Introduction & Summary

However, as discussed above, when prompted to rethink their answers, they frequently change their mind and the overall accuracy of the second answers is often worse than that of the original answers, sometimes even worse than random guessing.
The tendency to change their mind is strongly affected by the phrasing of the prompt. There is a large discrepancy between qualitative and quantitative confidence, although we observe a significant correlation between them.
When asked for confidence score, there is a strong tendency for overconfidence. The confidence measures are only partially explained by the underlying token-level probability. Overall, current LLMs do not show internally coherent sense of uncertainty or confidence in their answers.

3. Background

A. Testing & Reasoning Skills of LLM

As after all, final goal of every AI system is to mimic like human brain. Human brain is not something which is only restricted to do reasoning. Let’s discuss something which may be out of context but still helpful to understand where we are lacking in innovation of AI systems like LLM’s.
So, In Neurological sciences, on higher levels, human brain is divided into Four Sections: Section A, Section B, Section C and Section D.
Section A is responsible for handling all the tasks we do in daily life. In this Section A, there are four centers which are termed to IESM – Intelligence Center, Emotional Center, Sex Center & Motion Center. And the reasoning we are speaking now is approximately 8-10% of total human brain.
Every improvement we are doing now is to improve this reasoning part of AI systems by different ML & DL techniques.

Still there are 90% of brain workings are not yet implemented in AI systems.
So for now, its near to impossible to mimic human brain!!!

B. Empirical Studies

The Beyond the Imitation Game Benchmark (BIG-Bench) is a large set of tests designed to challenge large language models (LLMs) with 204 tasks that are thought to be too hard for them. These tasks cover areas like language, child development, math, common sense, biology, physics, social biases, coding, movie recommendations, and more. The BIG-Bench was created by 450 researchers from 132 different institutions. When it was first introduced, LLMs didn’t perform very well on these tasks.

C. Better Response with Better Prompt

We all know that Approximately GPT4o, GPT-Tubro considers 1 trillion Parameters for one single prompt. While Mistral contains 7 Billion parameters and Mistral small 3 have 24 billion parameters. What does it mean by this term “parameter” in above context?
So, basically when we ask some question to LLM’s or provide some prompt, we can say that GPT think the same prompt / question from 1 trillion different ways!!!
Just to make it simple... Let's take one simple question,
Question – How to Reach Pune from Mumbai?
So here, what can be related possibilities?

In Mumbai, from where exactly the user is asking question?
By which travelling mode (Road/ Bus / train / plane), user wants to travel from Mumbai to Pune?
In Pune, exactly where user wants to reach?

And so on… So, we can say that here we considered 3 parameters to answer the question of user. Same as this LLM’s considers trillions of parameters with such a great speed!!! Isn’t it being a technical miracle!!! But if this is the case, what is wrong with confidence / Accuracy of LLM’s? Yes that again a right question!!!
So as we seen earlier, Any AI system is mathematically build to solve problems. So if our prompt / question is logically not correct for LLM’s we can get some vague answers.
Again, interpretability of every LLM model is slightly different.
Then there arises a problem, how to find correct method of prompting for LLM’s?

So readers, there are some rules / studies which are standard for prompting which we can learn under the branch called “Prompt Engineering”.
But still this rules / standards of prompts are not same always. They can change time to time.
What is the good way to tackle this? For any non-technical person?
So what we can do is,

Take one questions whose answer you are already knowing.

Then give a prompt to LLM for the same question and see how its giving answer. Check how the LLM generated answer is close to the actual answer you are knowing.

If there is lot of deflection between actual answer you are knowing and LLM generated answer, try to change prompt.

After multiple trials, you will definitely find the correct prompt for which LLM are generating correct answer. Keep that same language in your head and prompt like that !!!

4. Methods

A. Some Comparisons between LLMs

In this comparison, we look at the performance of three advanced language models: OpenAI’s GPT-4o, GPT-4 Turbo, and Mistral’s Large 2 model. GPT-4o, released in August 2024, is OpenAI’s latest and most powerful model. It’s an improved version of the original GPT-4, designed to be faster and more efficient while maintaining strong reasoning abilities. Although GPT-4o also supports image and audio inputs, those features aren’t used in the tasks we’re testing here.

GPT-4 Turbo, which came out earlier in April 2024, is a lighter, more cost-effective version of GPT-4. It trades off some complexity—possibly with fewer parameters—in exchange for faster response times, making it ideal for applications that need speed.

Meanwhile, Mistral’s Large 2 model, released in July 2024, is the company’s most capable offering so far. It holds its own against other leading models, particularly in areas like general knowledge and reasoning, and performs strongly on benchmarks such as the Massive Multitask Language Understanding (MMLU).

B. Datasets

Big – Bench Hard – It is subset of Big Bench which contains tasks which will check capabilities of LLM. This tasks are often considered difficult.
Formal fallacies – Simply it is a error in logical reasoning.
Ex. Statement 1 – All Cats are Animals
Statement 2 – All Dogs are Animals

Conclusion – All Cats are Dogs (This is wrong – this is formal fallacy)

The researchers selected two tasks from the BIG-Bench Hard (BBH) benchmark—causal judgment (187 questions) and formal fallacies (250 questions)—originally introduced by Suzgun et al. (2022). These tasks are known for being especially difficult and are designed to test how well large language models (LLMs) can reason, understand, and solve complex problems. While Suzgun et al. explored how chain-of-thought (CoT) prompting could improve model performance, each BBH question is still presented with a single, straightforward instruction rather than a series of prompts. Sample questions are provided in Appendix A, and the full question sets can be found on GitHub.

To further evaluate how well the models handle statistical reasoning, the team also created 46 questions based on statistical puzzles and paradoxes from Pawitan and Lee (2024).

C. Role of Prompt

The performance of large language models (LLMs) often depends on how they are prompted. In this study, each model is first asked to answer questions directly without giving any explanation—this is called the “First answer.” After that, the model is asked to “Rethink” its response, giving it a chance to revise or confirm its original answer.

The researchers compare the model’s accuracy before and after rethinking. They also look at how accurate the models are when they stick with their first answer versus when they change it.

To keep the responses short and manageable, the prompts include instructions to be brief. However, models sometimes still give long answers, so the researchers manually review the outputs to make sure they’re reasonable. While asking for brevity might slightly affect how well the models perform, the initial results for the BBH tasks closely match those from previous studies (e.g., Zhou et al., 2024).

D. Statistical Analysis for Confidence Checking

Reported p values for comparisons of two proportions are based on the test with Yates’s correction. For

small 2-by-2 tables, the corrected p value is an approximation of the two-sided p value from Fisher’s exact test.

We will see more clearly about this next header.

5. Results

A. Accuracy and Quantitative Confidence

We can see that GPT4o and Mistrals showed almost same accuracy for both first and second answer.

As there +ve correlation between Accuracy and Confidence,

We can say that Confidence is also same for both answers for GPT-4o and Mistrals.

B. Comparison between models (In Context of Confidence)

It’s interesting to compare the newer models with their earlier versions—GPT-4o with GPT-4 (March 2023), and Mistral Large 2 with Mistral Large (February 2024). While their overall accuracy in all tasks is quite similar, their behavior when it comes to confidence is noticeably different.

GPT-4o and the earlier Mistral Large are both more likely to stick with their first answers, showing higher confidence. In contrast, GPT-4 and Mistral Large 2 are more likely to change their answers. For instance, in the formal fallacies task, GPT-4 changed its initial answer 83% of the time, while GPT-4o only did so 18% of the time.

We don’t know exactly what changed between the versions, but all these models are based on the same core transformer architecture. One key difference is likely the number of parameters, although exact figures aren’t available—except for Mistral Large 2, which has 123 billion parameters. It’s likely that Mistral Large 2 has more parameters than its earlier version, and GPT-4o may have fewer than GPT-4, since it’s optimized for computational efficiency.

Based on this, it seems that models with more parameters tend to second-guess themselves more often and change their initial answers more frequently.

6.Conclusions

The study looked into how confident large language models (LLMs) are in their answers, and how that confidence relates to how accurate those answers are.

Confidence was measured in two ways:

by seeing if the models stuck with their original answers when asked to rethink them (a qualitative sign of confidence)

AND

The confidence scores the models gave themselves (a quantitative measure).

Although the models generally performed better than random guessing, their level of confidence varied a lot depending on the task and the model used. One positive finding is that when models showed higher qualitative confidence—meaning they kept their first answer—they were usually more accurate.

However, there’s a downside: even when the initial answers were correct, models sometimes changed them unnecessarily, which led to lower accuracy overall. Confidence levels also changed based on how the questions were asked. And importantly, the models’ self-reported confidence scores were often much higher than their actual accuracy, suggesting that these scores may reflect overconfidence rather than true certainty.

For solving above problem of accuracy related to LLM’s or any other AI systems, Study of human brain on maximum possible depth is required,

As the ultimate goal of any AI system is to mimic human brain only.

Tags:

Interpretable