Skip to content
back to journal

hire ai developer

How to Hire an AI Developer in 2026 (Playbook + Red Flags)

The filter I use: scoping, skills checklist, sourcing, 5 interview questions with real answers, 2026 rate bands, and the 3 red flags that have burned every client.

Ralph DuinApril 17, 20268 min read

How to Hire an AI Developer in 2026 (Playbook + Red Flags)

TL;DR — Most "AI developers" on the market right now are prompt engineers with a job title upgrade. The actual hires that ship are builders who can write the un-sexy plumbing: retries, idempotency, rate limits, eval harnesses, observability. This post is the filter I use — scoping, skills, sourcing, interview questions with real answers, 2026 rate bands, and the three red flags that have burned every client I've rescued.

I've shipped production AI systems for solo founders, funded startups, and enterprises — AppHandoff, MCP Beast, a pile of custom LLM pipelines. I've also been called in to fix hires that didn't work out. The pattern is the same every time: the team hired someone who could prompt, not someone who could build. This post is the playbook that keeps you out of that category.

The 60-second decision table

SituationHireBudget signal
You have a bottleneck and one clear use case (support, docs, intake)Freelance AI engineer, 30-day trial$150–$300/hr
You raised seed, need a founding AI engineerFull-time senior, generalist$180k–$240k + equity
You're pre-PMF and just need to validatePart-time freelance, 10–20 hrs/wk$5k–$15k/mo
You have proprietary data + unique domainFull-time senior + part-time researcher$250k+ combined
You want "AI features" with no idea whatDo not hire yet. Scope first.$0 — go read posts below

See freelance vs agency for the engagement-model breakdown and AI developer for startups for stage-by-stage staffing.

Scope before you open the req

Every failed AI hire I've diagnosed started the same way: the company opened a job posting before writing down what problem the hire would solve. You'll waste $200k and six months if you skip this.

Write this down, in order, before posting a job:

  1. The one workflow this hire will automate. Not "AI strategy." One workflow. "Reduce support ticket triage time from 4min to 30s."
  2. Where the data lives today. Postgres? Google Drive? A pile of PDFs in Notion? If your data isn't accessible, the AI can't use it.
  3. The metric that proves it worked. Tickets/hour, accuracy %, cost per resolution. Pick one number.
  4. Who owns the output. If no human is accountable for the model's mistakes, don't ship it.

If you can't fill in those four lines, you're hiring for a vibe, not a job. Scope first.

What an AI engineer actually does (vs a prompt engineer)

A prompt engineer writes instructions. An AI engineer builds systems that keep working when the model misbehaves. The difference shows up in the same five places every time:

  • Retrieval. RAG isn't a buzzword — it's the pattern that stops hallucinations. Can the candidate explain chunking, embedding choice, and why naive cosine similarity fails on short queries?
  • Evals. "We tested it by trying a few prompts" is a red flag. Real engineers write eval suites that run on every deploy. Ask how they catch regressions.
  • Observability. Every tool call, every token count, every error category — logged. If they don't track duration_ms and error_type per call, they're flying blind.
  • Cost control. Token usage compounds. Good engineers cap context windows, cache aggressively, and route cheap queries to cheap models.
  • Failure modes. LLMs rate-limit, time out, and return malformed JSON. The handlers I trust in production all look like the rate limiter in MCP Server Architecture — 60/min per caller, structured errors the caller can act on.

If none of that comes up naturally in the interview, you're talking to a prompt engineer.

The skills checklist (use this verbatim)

Ship-ready AI engineers in 2026 have:

  • Python + TypeScript fluency. Python for ML/data, TypeScript for the API layer. If they only know one, they'll bottleneck one side.
  • Model-agnostic integration. OpenAI, Anthropic, open-source via Ollama or vLLM. If they're a "GPT person" only, they'll build you into a vendor-lock corner.
  • Vector DB experience. pgvector, Pinecone, Qdrant — doesn't matter which. What matters is they can explain when not to use one.
  • Streaming + async. SSE, WebSockets, background jobs. LLM responses are slow; blocking request/response won't scale.
  • Evals + observability. Named tools: Braintrust, LangSmith, or home-rolled. The answer "we just look at the outputs" is disqualifying.
  • One shipped production system they can demo. Not a demo video. A live URL or a repo you can actually run.

Every item on this list should show up as a specific story in the interview. If they're vague on three or more, pass.

Where to actually find them (and where not to)

Not LinkedIn job board. You'll drown in resume spam from people who took a Coursera class last month. Sources that actually work in 2026:

  • Open-source commits. Go look at who's contributing to LangChain, LlamaIndex, Pydantic AI, or any MCP SDK. Public commit history is the cleanest signal available.
  • AppHandoff's network — I keep a list of people I've shipped with. Describe your problem and I'll either take the work or send you to the right builder.
  • Toptal / Upwork Expert-Vetted. Usable for small scoped projects. Premium pricing but real vetting.
  • Referrals from founders who actually shipped AI. Ask in your YC/Founders/whatever Slack for names — not agencies, specific humans.

Skip: generalist recruiters ("we have AI developers!" — they don't), bootcamp grad pipelines (the curriculum is 18 months behind), and anyone whose LinkedIn changed from "blockchain consultant" to "AI consultant" in 2023.

Five interview questions that filter fakers

Ask these verbatim. I've used them on dozens of candidates; the good ones light up, the fakes stall within 30 seconds.

1. "A user says your AI gave them wrong information. Walk me through how you'd debug it." Weak answer: "I'd rewrite the prompt." Strong answer: "Pull the trace — the exact tokens in and out. Check retrieval: did we fetch the right context? Check the eval suite: does this case match a known regression? Then decide if it's a prompt problem, a retrieval problem, or a model problem — they need different fixes."

2. "How do you stop hallucinations in a customer-facing bot?" Weak: "Better prompts." Strong: "You don't stop them — you constrain and detect. Constrain: RAG with citation-required prompting, function calling for anything structured, refuse-to-answer on low-confidence retrieval. Detect: eval set of known-hard questions run on every deploy, plus logging every 'I don't know' so we learn the coverage gaps."

3. "Walk me through how you'd structure rate-limit errors from OpenAI so the front end can recover gracefully." Weak: "Just retry." Strong: Talks about exponential backoff, surfacing retry_after_ms to the client, queueing at the server, and — ideally — mentions structured error payloads (this is what MCP got right; see MCP vs API).

4. "Explain fine-tuning vs RAG like I'm your CEO." Weak: Uses the words "parameters" and "embeddings" in the first sentence. Strong: "RAG is giving the model an open book at test time. Fine-tuning is teaching it the book once, permanently. RAG is cheap and updates instantly. Fine-tuning is expensive but makes the model faster and more consistent on narrow tasks. Most companies need RAG first and never need fine-tuning."

5. "Show me something you shipped that's running right now." Weak candidates show you a Loom of a demo. Strong candidates give you a URL or a repo and walk you through a real trace.

2026 compensation (US, remote-comparable)

RoleFull-time baseFreelance hourlyReality check
Junior (1–2 yrs, API integrator)$130k–$160k$75–$125Can wire GPT to your app; will break on production edge cases
Senior (3–5 yrs, system builder)$180k–$240k$150–$250Ships evals, handles failure modes, owns a feature end-to-end
Lead / Staff (5+ yrs, architect)$250k+ (often $300k+ with equity)$250–$400Designs the platform; you hire one, not four
Specialist (researcher, eval infra, novel modality)$300k+$400+Only hire when the senior hits a ceiling

Same numbers appear in AI developer for startups and freelance vs agency — if you're comparing posts, they match on purpose.

The trap: paying senior rates for junior output because the candidate used the word "transformer" five times. Filter on shipped work, not vocabulary.

Three red flags that cost me clients before I got the call

I've rescued enough of these to know the pattern:

  1. "Two weeks to production." Production AI takes 4–8 weeks minimum because evals + observability + failure handling take longer than the happy path. Anyone promising two weeks is shipping you a demo, not a product.
  2. Refusal to share prompts or eval sets. If the engineer treats the prompts as their IP and won't commit them to your repo, you're renting a system you can't maintain. Walk.
  3. No eval suite. If they can't show you the test cases the system passes (and fails), they're guessing at quality. Every shipped AI system I trust has an eval suite in CI.

If you see two of three, pivot within the 30-day trial. Don't sunk-cost-fallacy into a $200k hire.

The 30-day trial contract (what to actually ask for)

Before the full-time offer, run a paid trial. Scope looks like this:

  • Week 1: Ship one endpoint end-to-end — input, RAG, LLM call, output, logs, one eval.
  • Week 2: Add observability + cost tracking. Must be able to answer "what did this cost us today?" by end of week.
  • Week 3: Handle failure modes — rate limits, timeouts, malformed output. Errors must surface structured data to the client.
  • Week 4: Deploy. Real URL. Real users (even 5). Measure against the metric you defined in scoping.

If weeks 1–2 slip more than 3 days, you're seeing the red flags early. That's the point of the trial.

Ready to hire?

I run this filter in practice. If you want help scoping, reviewing candidates, or just shipping the first version yourself with a builder who's done it before — describe what you're building and I'll tell you what the first 30 days should look like.

Related posts