Systematic Evaluation of Content Quality in LLMs through Token-Level Uncertainty

In this research, by thoroughly analyzing token-level uncertainty and introducing a suite of novel uncertainty metrics, we offer key insights that enable more systematic evaluation of LLM-generated content. A major distinction between incorrect and irrelevant content is that in the former, only few tokens are under-confident, whereas in the latter, under-confident tokens become significantly larger. We design novel aggregation functions that create uncertainty metrics tailored for LLM content evaluation.