PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses in LLMs

Hong, Minki; Lee, Eunsoo; Park, Sohyun; Kim, Jihie

doi:10.1109/ACCESS.2026.3679809

상세 보기

PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses in LLMs

Hong, Minki;
Lee, Eunsoo;
Park, Sohyun;
Kim, Jihie

Citations

WEB OF SCIENCE

0

Citations

SCOPUS

0

초록

Prompt design is a primary control interface for large language models (LLMs), yet standard evaluations largely reduce performance to answer correctness, obscuring why a prompt succeeds or fails and providing little actionable guidance. We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses. PEEM defines a structured rubric with 9 axes: 3 prompt criteria (clarity/structure, linguistic quality, fairness) and 6 response criteria (accuracy, coherence, relevance, objectivity, clarity, conciseness)—and uses an LLM-based evaluator to output (i) scalar scores on a 1–5 Likert scale and (ii) criterion-specific natural-language rationales grounded in the rubric. Across 7 benchmarks and 5 task models, PEEM’s accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate Spearman ρ ≈ 0.97, Pearson r ≈ 0.94, p < 0.001). A multi-evaluator study with four models shows consistent relative judgments (pairwise ρ = 0.68–0.85), supporting evaluator-agnostic deployment. Beyond alignment, PEEM captures complementary linguistic failure modes and remains informative under prompt perturbations: prompt-quality trends track downstream accuracy under iterative rewrites, semantic adversarial manipulations induce clear score degradation, and meaning-preserving paraphrases yield high stability (robustness rate ≈ 76.7–80.6%). Finally, using only PEEM scores and rationales as feedback, a zero-shot prompt rewriting loop improves downstream accuracy by up to 11.7 points, outperforming supervised and RL-based prompt-optimization baselines. Overall, PEEM provides a reproducible, criterion-driven protocol that links prompt formulation to response behavior and enables systematic diagnosis and optimization of LLM interactions. © 2013 IEEE.

키워드

evaluation; interpretability; large language models; natural language generation; prompt engineering

제목: PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses in LLMs

저자: Hong, Minki; Lee, Eunsoo; Park, Sohyun; Kim, Jihie

DOI: 10.1109/ACCESS.2026.3679809

발행일: 2026

유형: Article

저널명: IEEE Access

권: 14

페이지: 53581 ~ 53600