Much Ado about Prompting:
LLM Classification of Text Messages from Experiments
Can Çelebi · Stefan P. Penczynski
Codebooks & pipeline
Codebooks
The three original codebooks as written for human annotators: Promise (P2), Level-k voting (L1), Level-k coordination (L2).
Paper §4.1 — Data.
Source of the human ground-truth labels.
Interactive
Pipeline
Codebook → filter → separate → compress. The four-step transformation with a word-level diff per step.
Paper §4.2 — Codebook-to-Prompt Pipeline.
Cosine-similarity compression reported in Table A1.
Interactive
Prompt assembly
Build any prompt variant interactively: pick every dimension at once (content, format, framing, examples).
Paper §4.2 & Fig. 1 — Prompt template.
The standardised assembly used in every experiment.
Interactive
By experiment
Content
Classification, experiment and theory sections at verbatim → C1 → C2 compression levels.
Paper §5.1 — Results: Content.
Table 1 (acc. at minimal), Fig. 2 information surfaces.
Interactive
Format
Markdown, plain titles, prose, bulleted list, plaintext — the five tested formats.
Paper §5.2 — Formatting & Framing.
Table 2 (tab:format_framing_sig).
Interactive
Framing
Role persona and general-task line — tested by ablation on rich and minimal prompts.
Paper §5.2 — Formatting & Framing.
Integrated in Table 2.
Interactive
Examples ($n$-shot)
0-shot vs. the same prompt with the codebook's worked examples appended.
Paper §5.3 — Results: Examples.
Table 3 (tab:gpt54_examples).
Interactive
Example formatting
The same worked examples in five styles: markdown, XML, delimiter, arrow, multi-turn.
Appendix D.4 — Example Format.
Table A6 (tab:shot_format_sig).
Interactive
Verbose & noise
Verbatim sections alongside verbose elaborations and matched-length noise.
Appendix F — Information Fidelity
(also §6 discussion). Figs. A10–A11.
Interactive
Model outputs
Example items
Real subject text from each task with GPT 5.4's classification — easy and hard cases side by side.
Appendix D.2 — Example Counts by Label.
Table A4 (tab:app_example_counts); human-rater accuracy 82–88%.
Interactive