HARSH PRATAP SINGH

Which Prompts Actually Work for your Agents?

I was wondering on a lazy sunday, which parts of my prompt actually matter?

Agents (ReAct, tool-calling, multi-step reasoning) depend heavily on system prompts: role, rules, tool descriptions, few-shot examples. It’s easy to bloat them and hard to know what’s redundant. So, I found out about Saliency analysis which gives you numbers, perturb each phrase, see how much the agent’s output changes. High change → that phrase matters; low change → candidate to cut or simplify. So, the goal is to find which parts of your agent’s system prompt actually drive behaviour, then trim the rest and protect what matters. Simple, yet I don't see a lot of people using it.

This is a sensitive issue

Recent research has quantified just how sensitive LLMs are to prompt formulation. LLMs show extreme sensitivity to subtle changes in prompt formatting, even after instruction tuning and scaling.

The ProSA framework established that:

This means that two semantically equivalent prompts can produce dramatically different outputs, making prompt engineering a high stakes optimization problem with no clear gradient signal.

No real traditional debugging to save me

Traditional software debugging relies on:

  1. Deterministic execution - Same input → same output
  2. Inspectable state - Variables, stack traces, breakpoints
  3. Localized effects - Changes propagate predictably

LLM prompts violate all three assumptions:

  1. Outputs are stochastic (using temperature=0 gives more stable comparisons)
  2. Internal model state is opaque (billions of parameters, no interpretable variables)
  3. Token interactions are highly non-local (attention spans the entire context)

Perturbation-based saliency addresses this by treating the LLM as a black-box function and inferring input importance from output changes under controlled edits, without requiring model internals.

The Maths behind this

The Vector Space Model (VSM), represents text as vectors in a high-dimensional space where each dimension corresponds to a distinct term (or in our case, character n-gram). So, the texts with similar content will have similar vector representations, enabling geometric operations (distance, angle) to capture semantic relationships.

The things I am gonna discuss below use character trigrams rather than word-level tokens. A character n-gram is a contiguous sequence of n characters extracted from text. Why? Character trigram overlap is effective for sentence alignment in text simplification tasks

Property Benefit
Language-agnostic Works for any language without tokenizers
Typo-robust Small character changes don't destroy similarity
No external dependencies No NLP libraries required
Paraphrase-tolerant Captures subword patterns that survive rephrasing
Computationally efficient O(L) extraction, sparse representation

We treat Saliency as Divergence. If a phrase is important to the model's output, removing or altering it should cause the output to change significantly. Obviously!!

More formally, let :

The saliency score for phrase pi is:

S(pi)=1sim(f(P),f(Pi))

Interpretation:

This formulation treats saliency as output divergence under intervention, a causal notion that measures the counterfactual impact of each phrase.

Perturbation Methods

Method What you do API cost When to use
Perturbation Replace phrase with [...] N+1 Default: fast, keeps sentence structure.
Omission Remove phrase entirely N+1 Short prompts; you want to see effect of full removal.
Paraphrase Ask an LLM to rewrite the phrase to be vague, then run agent 2N+1 When you care about semantic content only (slower, more API calls).

For agentic workflows, perturbation or omission is usually enough; paraphrase is for deeper semantic analysis when needed. The paraphrase method isolates semantic content from structural presence:

This is the most faithful measure of information contribution because it controls for the structural role of the phrase while zeroing out its semantic payload.

When to use it?

Why this works?

Implementation

Assume your agent as a black box: (system_prompt, user_message) → output. Wrap your agent in one async function that takes (system_prompt, user_message) and returns a string (or a metric).

Tokenize the Prompt into Phrases

Split the system prompt into chunks you'll perturb one at a time.

  1. Split at sentence boundaries (. ! ? newline).
  2. If a sentence is longer than ~60 characters, sub-split at commas/semicolons.
  3. Accumulate sub-chunks until each is at least ~35 characters.

Why phrase-level?

For structured prompts (JSON schemas, code blocks), consider custom tokenizers that respect structure boundaries.

Get the Baseline Output

Run your agent with the full, unmodified system prompt and a representative user message. Save this output, it's your reference for comparison.

Perturb Each Phrase and Measure Divergence

For each phrase i:

  1. Perturb: Replace phrase i with [...] (or remove it entirely for omission method).
  2. Run: Call your agent with the perturbed prompt and the same user message.
  3. Compare: Measure how different the new output is from the baseline.
  4. Score: saliency = 1 - similarity(baseline, perturbed_output).

High saliency = perturbing this phrase changed the output a lot = important phrase.

Choose a Similarity Function

You need a single number for "how similar are these two outputs?"

I prefer using Sentence embeddings :

Normalise Scores

Raw saliency scores depend on the specific outputs and similarity function. To compare across phrases:

Batch Over Multiple User Messages

Saliency for one user message tells you "importance for this query." To generalize:

  1. Pick 5–10 representative user messages (cover different intents your agent handles).
  2. Run saliency for each user message.
  3. Average the normalised scores per phrase across all user messages.

Now you know which phrases matter on average, not just for one query.

Multiple Runs for Stability

Even with temperature=0, outputs can vary slightly. For production:

  1. Run each perturbation K times
  2. Average the scores across runs.
  3. Compute 95% confidence intervals: mean ± 1.96 * (stdev / sqrt(K)).

Pruning rule: Only drop phrases where the upper bound of the CI is below your threshold (e.g. < 0.3). This ensures you're confident the phrase is low-impact, not just noisy.

Act on the Results

Score range Interpretation Action
High (top third) Critical phrases Protect; clarify if agent misbehaves
Low (bottom third, CI upper < 0.3) Redundant or weak Candidate for pruning
Middle Moderate impact Keep; revisit later

Pruning workflow:

  1. Drop low-saliency phrases (where upper CI < threshold).
  2. Re-run your agent on the same and new user messages.
  3. Verify behaviour is unchanged (use your existing evals).
  4. Iterate.

Detect Interactions

Single-phrase saliency assumes independence. To catch conflicts or synergies:

  1. Perturb pairs of phrases together.
  2. Compute: interaction(i, j) = saliency(i+j) - saliency(i) - saliency(j).
  3. Positive = synergy (removing both hurts more than sum). Negative = conflict (removing both hurts less).

Cost is O(N²), so only do this for short prompts or after pruning.


Limitations

Edit

Actually did the implementation today! It's a bit advanced implementation. It's CLI based for now, maybe should make it browser based if want to post about it?