$ cat /blog/how-i-prompt-my-llms.md

How I Prompt My LLMs: A 6-Step, Reproducible Workflow for Consistent Outputs

A repeatable prompting workflow to coax reliable, useful outputs from LLMs across tasks, with concrete prompts and guardrails.

How I Prompt My LLMs: A 6-Step, Reproducible Workflow for Consistent Outputs

I’ve learned the hard way that an LLM is only as useful as the prompts you feed it. In my bootstrapped workflow, prompts are treated like engineering artifacts: versioned, tested, and designed to be repeatable across tasks. This six-step workflow has kept my outputs stable, actionable, and auditable—no magic, just discipline.

If you’re building solo in the SF Bay Area or anywhere else, the goal is the same: reliable results without chasing after VC-backed hype. Here’s the workflow I actually use, with concrete prompts, guardrails, and guard-tested patterns you can drop into your own projects.


Step 1 — Define the task and success criteria

Before you touch the model, crystallize what you want and how you’ll know you got it.

  • Task definition: What is the concrete deliverable? A feature spec, a user story, a PRD, an email draft, a bug triage summary?
  • Success criteria: How will you measure “done”? Is it completeness, correctness, tone, length, or a strict schema?
  • Boundaries: What should the model avoid? Jargon, fluff, long-winded explanations, or speculative risks?

A practical exercise I run:

  • Task: Draft a 2-page product spec for a self-serve onboarding flow for a new analytics product.
  • Acceptable outputs: a Markdown document with sections (Overview, Goals, Scope, Metrics, UX Mock, Acceptance Criteria), max 1200 words, no fluff, bullet-point dense.

Concrete prompts to capture Step 1:

  • Task prompt (usable as-is):

    • "You are a senior product engineer. Create a concise product specification for a self-serve onboarding flow for a new analytics product. Output in Markdown with the following sections: Overview, Goals, Scope, Metrics, UX Considerations, Acceptance Criteria. Keep it under 1200 words, avoid fluff, use bullet points where helpful."
  • Success criteria to lock in:

    • "Output must be valid Markdown with the exact sections listed."
    • "At least 4 metrics/KPIs."
    • "Acceptance Criteria include at least 3 user flows and 2 edge cases."

I keep this in a template file in my repo: prompts/task-1/readme.md and a small checklist as a separate file. It sounds small, but having the criteria written down makes later reproducibility trivial.

Code snippet (example checklist in a shell-friendly doc):

# Task 1 — Checklist
[ ] Markdown sections: Overview, Goals, Scope, Metrics, UX, Acceptance Criteria
[ ] 4+ metrics
[ ] 2 edge cases
[ ] <= 1200 words

Step 2 — Pick model, temperature, and runtime constraints

Reproducibility starts with a deterministic setup. For most production-like prompting I keep the model and sampling settings tight.

  • Model choice: For output stability, I default to GPT-4 Turbo or a capable local model when privacy is essential.
  • Temperature and top_p: Temperature = 0.1–0.2, Top_p = 0.95–0.98. The lower temperature keeps outputs consistent; top_p retains a broad but controlled distribution.
  • Max tokens: Set to cover the longest expected answer plus a buffer (e.g., 1500–2500 tokens for a multi-section spec).
  • System prompt discipline: Prefer a strong, role-based system prompt to anchor style and constraints.

Concrete settings I use:

  • OpenAI API: model="gpt-4-turbo", temperature=0.15, top_p=0.95, max_tokens=1800
  • Local LLMs (if privacy matters): llama.cpp 8-bit quantized, set temperature to 0.2, with a strict output length cap in the wrapper
  • Timeout guardrails: if the response is truncated or contains obvious junk, discard and retry with the same seed

Prompt seed in code (Python-esque pseudo):

params = {
  "model": "gpt-4-turbo",
  "temperature": 0.15,
  "max_tokens": 1800,
  "top_p": 0.95,
}

This part matters: the same seed, same constraints, almost always yields the same structure, even if phrased slightly differently.


Step 3 — Build a structured prompt template (the “prompt as code”)

Structure is everything. I rely on a clean, repeatable template that enforces sections, formats, and output schemas.

What goes into the template

  • System role: a crisp description of the model’s role.
  • Task: what the model should deliver.
  • Output format: how the answer should be structured (Markdown sections, JSON, or YAML).
  • Constraints: word count, tone, no fluff, etc.
  • Examples: a couple of short, representative prompts and their ideal outputs.
  • Validation hints: how you’ll verify results, and what to do on mismatch.

A concrete prompt template you can copy:

# System
You are an experienced software engineer and product designer. You produce concise, production-ready outputs. Always provide a Markdown document with explicit sections unless otherwise requested.

# User
Task: {task_description}
Output format: Markdown with sections: Overview, Goals, Scope, Metrics, UX Considerations, Acceptance Criteria.
Constraints: <= 1200 words; no fluff; bullet points where helpful; include edge cases.
Quality checks: If a section is missing or metrics are fewer than 4, you must add them.

# Examples
Example 1:
Task: Draft a feature spec for a self-serve onboarding flow.
Output: (provide Markdown)
Example 2:
Task: Write a short internal memo about reducing deployment risk.
Output: (provide Markdown)

# Assistant

And your actual system prompt will replace placeholders with your task. The key is to fix the format and ensure the assistant knows exactly how you want the answer structured.

A real example of an output schema inside Markdown:

# Product Specification: Self-Serve Onboarding Flow

## Overview
...

## Goals
...

## Scope
...

## Metrics
CAC reduction: 12%
Activation rate: 42%
Time to onboarding: 3 minutes
Support tickets in onboarding: 15 per 1000 signups

## UX Considerations
...

## Acceptance Criteria
Criterion 1
Criterion 2
Criterion 3

This explicit structure makes downstream automation straightforward—no guesswork about where to read the data.


Step 4 — Add guardrails and validation prompts

Prompts should fail gracefully if outputs drift. I incorporate explicit guardrails in two layers: content guardrails and validation prompts.

Content guardrails (in the prompt):

  • Output must be in the specified format (Markdown with sections, or strictly JSON).
  • No speculative claims; if data isn’t known, request clarifications or mark as “needs input.”
  • Use concrete metrics and avoid vague adjectives like “great,” “best,” or “amazing.”

Validation prompts (the second pass):

  • After getting the initial draft, run a separate pass that checks for:
    • Completeness: all required sections present.
    • Metrics: at least four metrics, all metric definitions present.
    • Length: under the specified word count.
  • If anything fails, trigger a corrective prompt to fill gaps or re-run with a seed revision.

Example validation prompt you can reuse:

System:
You are a quality-assurance prompt assistant. Your job is to verify structured outputs.

User:
Review the following Markdown product spec. Ensure it contains sections: Overview, Goals, Scope, Metrics, UX Considerations, Acceptance Criteria. Confirm there are at least four metrics with definitions. If missing, propose exact edits.

Output: (return a short checklist and, if needed, corrected Markdown)

Sample Python gatekeeper snippet:

def validate(markdown_text):
    required_sections = ["Overview", "Goals", "Scope", "Metrics", "UX Considerations", "Acceptance Criteria"]
    ok = all(section in markdown_text for section in required_sections)
    metrics_count = markdown_text.count("- ")  # naive, but workable for bullet metrics
    return ok, metrics_count >= 4

In practice, I use a small wrapper that calls the model twice: first for draft, second for validation, and it automatically corrects any issues flagged by the validator.


Step 5 — Verification and a lightweight chain-of-thought check

Consistency isn’t just about structure; it’s about truth and usefulness.

What I do in practice:

  • Cross-check facts and numbers: are the metrics realistically achievable? Do I need to adjust the targets?
  • Logical consistency: do the goals align with the scope? Do the acceptance criteria reflect both success and failure modes?
  • Style and tone check: is the voice consistent with my brand (concise, practical, no fluff)?

A practical two-pass flow:

  • Pass 1: Generate the draft with strict format constraints.
  • Pass 2: Run a short “validation” pass asking the model to critique its own draft in a structured way, then apply the recommended edits.

A small, self-contained example prompt for Pass 2:

System:
You are an editor who improves a technical product spec without changing its meaning.

User:
Please critique the following Markdown spec. Identify missing sections, insufficient metrics, or vague language. Propose exact edits in bullet-point form and provide a revised Markdown draft.

Markdown:
<insert draft here>

The revised draft goes back into the workflow for a final, high-confidence output. It’s not about clever trickery; it’s about building a reproducible guardrail against drift.


Step 6 — Reproducibility and version control

If you don’t version your prompts, you’re not truly building with LLMs. I treat prompts like code: versioned, documented, and peer-reviewed in a solo-posture that still maintains a changelog.

Practices I follow:

  • Put prompts in the repo: prompts/task-01.md, prompts/task-01.template, prompts/README.md
  • Version the output schema: the exact section headers, metrics keys, and acceptance criteria format must be stable
  • Seed prompts: keep a master seed prompt that travels across tasks, and only adjust task-specific fields
  • Track results: log the prompt version, model, temperature, and a short summary of the output quality
  • Reproduce a run locally: use a small Dockerfile or a local environment to run the same prompts with the same seeds

Minimal reproducibility script (bash-ish):

#!/usr/bin/env bash
set -euo pipefail

PROMPT_DIR="prompts/task-01"
MODEL="gpt-4-turbo"
TEMPERATURE=0.15
MAX_TOKENS=1800

# Save seed and run (pseudo)
echo "Running with seed 42"
export SEED=42

python run_prompt.py \
  --prompt "$PROMPT_DIR/master.prompt" \
  --model "$MODEL" \
  --temperature "$TEMPERATURE" \
  --max-tokens "$MAX_TOKENS" \
  --seed "$SEED"

The important part is not the exact commands but the discipline: every run should be reproducible, with a clear record of the environment, the seed, and the exact prompt used.


Concrete prompts and guardrails you can copy

Here are two ready-to-use prompts you can adapt. They follow the six-step workflow and are designed to be drop-in for real tasks.

  1. Prompt template for feature-spec drafting
System:
You are an experienced product engineer. Deliver a concise, production-ready Markdown document with sections: Overview, Goals, Scope, Metrics, UX Considerations, Acceptance Criteria. Use bullet points where helpful. No fluff.

User:
Task: Draft a feature spec for a self-serve onboarding flow for a new analytics product. Output in Markdown with the sections listed above. Limit to 1200 words. Include at least 4 metrics with definitions. Edge cases: first-run failures, network interruptions.

Assistant:
  1. Validation prompt (Post-draft quality check)
System:
You are a QA prompt assistant. Validate that the provided Markdown includes all required sections, at least four metrics with definitions, and is under 1200 words. If not, propose exact edits and a corrected Markdown draft.

User:
<paste the draft here>

For practical use, you would embed these into a small wrapper that orchestrates the flow: generate draft, run validator, apply edits, generate final version.

2 practical tips I use while integrating these prompts:

  • Output format as ground truth: always enforce the exact section headers and a strict Markdown format. If the model ever deviates, your validator should fix it, not you.
  • Keep a deterministic seed for every run: if a run drifted, you can reproduce exactly by reusing the same seed, model, and temperatures.

Real-world example: a six-step session in practice

I recently used this workflow to draft a product spec for a new analytics onboarding flow. Here’s roughly how it played out:

  • Step 1: Task is defined and acceptance criteria captured in a single family prompt. The result: a draft with six sections and 4 metrics, all clearly stated.
  • Step 2: Model and settings locked. GPT-4 Turbo, temperature 0.15, top_p 0.95, max_tokens 1800. Output remained stable across runs.
  • Step 3: The structured template was used, so the draft arrived in neat Markdown already.
  • Step 4: Guardrails ensured there was no fluff and kept the length in check.
  • Step 5: A second-pass validation flagged a missing edge case and suggested adding a latency KPI. The edits were applied automatically.
  • Step 6: The whole run was committed to git with a brief PR-like message. The seed prompt, the model settings, and the final Markdown were all recorded.

Result: a usable spec that I could hand to a designer and a dev team without rewriting. The process felt boring instead of magical, which is exactly what I want when I’m shipping solo.


Takeaways and practical conclusions

  • Treat prompts as code: keep them in version control, document changes, and test them against edge cases.
  • Default to deterministic configurations for repeatable outputs: low temperature, sensible max tokens, and a stable seed.
  • Build a two-pass flow: draft + validator (and, if needed, a self-check pass) to keep drift out.
  • Use structured outputs: Markdown sections or JSON with explicit fields makes downstream automation (CI, tests, deployments) possible.
  • Document the exact success criteria up front. If you can’t measure it, you can’t trust it.

If you’re a solo developer or indie maker, this is how I keep prompts honest, reproducible, and genuinely useful. It’s not glamorous, but it’s leverage you can actually rely on when you’re bootstrapping a product with real customers.


Closing thoughts

I’ve found that the difference between “pretty good” and “actually reliable” prompts is a disciplined workflow, not a clever trick. Start small: pick one repeatable task, lock the six steps, and iterate. You’ll be surprised by how much time you gain and how much more predictable your outputs become.

If you want to see more behind-the-scenes on how I structure prompts, how I version them, or how I validate outputs in production, I’ll share more examples in future posts. And hey, if you’re curious about my day-to-day tools, you can catch me on X at @fullybootstrap.

Until next time, keep your prompts lean, your outputs repeatable, and your bootstrapped products profitable.