03.06.2026 New Publication on Evaluating Large Language Models

Evaluating large language models for feature extraction from verbal stimuli: A simulation-based workflow

Authors:
Angelike, T., & Heck, D. W. (2026)

Abstract:
Large Language Models (LLMs) are increasingly used as research tools to facilitate the fast and automated extraction of text features. In psychological studies, they have been used to quantify the degree to which verbal stimulus materials reflect certain psychological constructs. However, the application of LLMs entails a high degree of flexibility regarding prompt design (e.g., instruction details and examples) and model specification (e.g., model family, size, and configuration), which can produce divergent results and threaten the robustness and generalizability of conclusions. To navigate the multiverse of possible choices, we develop a structured workflow for evaluating the quality of LLM feature extraction across diverse model and prompt specifications. Motivated by generalizability theory, the workflow distinguishes between construct variance across stimulus items, method variance due to model and prompt choices, and error variance across repeated iterations. To guide researchers through the planning, execution, and reporting of LLM simulation studies, we introduce an adapted version of the ADEMP template (Aims, Data-generating mechanism, Estimands and targets, Methods, Performance measures), originally developed for methodological simulation research. The template supports two complementary validation strategies: variance decomposition for studying consistency across LLM specifications and external validation against human gold-standard ratings. In a pre-registered case study using locally runnable, open-weight LLMs, we illustrate the workflow by examining the influence of model choice, response format, and prompt examples on the quality of valence and arousal ratings for multi-word expressions. We additionally assess the efficacy of aggregating repeated, stochastic LLM ratings to improve feature extraction quality.

Angelike, T., & Heck, D. W. (2026). Evaluating large language models for feature extraction from verbal stimuli: A simulation-based workflow. Psyarxiv. https://osf.io/preprints/psyarxiv/xphm9_v1/