Much Ado about Prompting — Companion

Codebooks & pipeline

The three original codebooks as written for human annotators: Promise (P2), Level-k voting (L1), Level-k coordination (L2).

Paper §4.1 — Data. Source of the human ground-truth labels.

Codebook → filter → separate → compress. The four-step transformation with a word-level diff per step.

Paper §4.2 — Codebook-to-Prompt Pipeline. Cosine-similarity compression reported in Table A1.

Prompt assembly

Build any prompt variant interactively: pick every dimension at once (content, format, framing, examples).

Paper §4.2 & Fig. 1 — Prompt template. The standardised assembly used in every experiment.

By experiment

Classification, experiment and theory sections at verbatim → C1 → C2 compression levels.

Paper §5.1 — Results: Content. Table 1 (acc. at minimal), Fig. 2 information surfaces.

Markdown, plain titles, prose, bulleted list, plaintext — the five tested formats.

Paper §5.2 — Formatting & Framing. Table 2 (tab:format_framing_sig).

Role persona and general-task line — tested by ablation on rich and minimal prompts.

Paper §5.2 — Formatting & Framing. Integrated in Table 2.

Examples ($n$-shot)

0-shot vs. the same prompt with the codebook's worked examples appended.

Paper §5.3 — Results: Examples. Table 3 (tab:gpt54_examples).

Example formatting

The same worked examples in five styles: markdown, XML, delimiter, arrow, multi-turn.

Appendix D.4 — Example Format. Table A6 (tab:shot_format_sig).

Verbose & noise

Verbatim sections alongside verbose elaborations and matched-length noise.

Appendix F — Information Fidelity (also §6 discussion). Figs. A10–A11.

Model outputs

Real subject text from each task with GPT 5.4's classification — easy and hard cases side by side.

Appendix D.2 — Example Counts by Label. Table A4 (tab:app_example_counts); human-rater accuracy 82–88%.