
ReliabilityBench: Measuring the Unpredictable Performance of Shaped-Up Large Language Models Across Five Key Domains of Human Cognition
The research evaluates the reliability of large language models (LLMs) such as GPT, LLaMA, and BLOOM, extensively used across various domains, including education, medicine, science, and administration. As the usage of these models becomes more […]