We introduce RewardMATH, a benchmark for evaluating the robustness of reward models in mathematical reasoning tasks.
Oct 2, 2024
We introduce TRAIT, a new benchmark consisting of 8K multi-choice questions designed to assess the personality of LLMs.
Jun 20, 2024
We present Theanine, a framework for lifelong dialogue agents that leverages temporal and causal relations between memories to improve response generation, along with TeaFarm, a novel evaluation scheme.
Jun 16, 2024
We introduce BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks using instance-specific evaluation criteria.
Jun 9, 2024