3

Evaluating Robustness of Reward Models for Mathematical Reasoning

We introduce RewardMATH, a benchmark for evaluating the robustness of reward models in mathematical reasoning tasks.

Oct 2, 2024

Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics

We introduce TRAIT, a new benchmark consisting of 8K multi-choice questions designed to assess the personality of LLMs.

Jun 20, 2024

Towards Lifelong Dialogue Agents via Relation-aware Memory Construction and Timeline-augmented Response Generation

We present Theanine, a framework for lifelong dialogue agents that leverages temporal and causal relations between memories to improve response generation, along with TeaFarm, a novel evaluation scheme.

Jun 16, 2024

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

We introduce BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks using instance-specific evaluation criteria.

Jun 9, 2024