Building an AI Phone Agent — Part 1: The Webhook Loop
A short series on what I learned while building a voice agent that picks up real phone calls. Day 1: how a server can even hear a phone call to begin with.
The mental model
A “phone agent” is a regular phone number — except instead of ringing a person, it rings a webhook. Whatever your code returns becomes the voice on the other end.
caller → telco → provider (Twilio/Stringee) → your webhook → STT → LLM → TTS → callerTwo ideas to internalize.
1. The provider is a virtual carrier
Twilio, Stringee, Plivo — they’re not “plugins” that telcos like Viettel or AT&T install. They’re carriers themselves: they own blocks of phone numbers registered with the regulator (NANP for the US, ITU for the rest of the world) and rent slices of those numbers to developers. A call to a Twilio number looks like a normal international call to Viettel; it just happens to terminate inside Twilio’s cloud instead of someone’s handset.
You can only control numbers you’ve rented from the provider. You cannot point a webhook at someone else’s number — account isolation, regulatory rules, and number ownership block that at every layer.
2. The webhook loop
- You rent a number from the provider’s dashboard.
- You set a webhook URL for incoming calls.
- When someone dials the number, the provider:
- answers the call on the PSTN side,
- opens a bidirectional audio stream (Twilio Media Streams / Stringee WebRTC) to your server,
- sends frames as the caller speaks.
- Your server feeds those frames into STT → LLM → TTS, and streams response audio back.
- The provider plays your audio out to the caller.
Latency target end-to-end: under ~1.2 s per turn. Past 2 s the conversation feels broken.
Wiring up Phase A
To prove the loop end-to-end before touching any AI, the goal of Phase A is dumb: get a real phone to ring, a webhook to fire, and a recorded greeting to play back. No STT, no LLM, no TTS — just “can the provider reach my laptop?”
A stable URL for the webhook
Twilio needs a public URL it can POST to. The naive option is cloudflared tunnel --url http://localhost:3001, which gives a random *.trycloudflare.com hostname every restart. Fine for a one-shot, painful when you have to update the Twilio webhook config every time you reboot.
The fix is a named tunnel bound to a real subdomain:
cloudflared tunnel create dialtone
# → tunnel UUID + credentials JSON written to ~/.cloudflared/<uuid>.json
# DNS: CNAME dialtone.leo6103.com → <uuid>.cfargotunnel.com (proxied)Routing config lives in the repo (.cloudflared/config.yml), credentials stay gitignored:
tunnel: <uuid>
credentials-file: .cloudflared/<uuid>.json
ingress:
- hostname: dialtone.leo6103.com
service: http://localhost:3001
- service: http_status:404make prod now opens the tunnel; the webhook URL is permanent.
Twilio: number + voice config
- Sign up (verify identity, $15 trial credit).
- Phone Numbers → Buy a Number → US Local, Voice capability, $1.15/mo. Capabilities like SMS/MMS don’t change the rental fee — that’s per-message usage billing later. Toll-Free is $2/mo (different number class, not “more features”).
- Active Numbers → click the number → Voice Configuration → set:
A CALL COMES IN Webhook: https://dialtone.leo6103.com/voice HTTP : POST
Skip Emergency Calling (E911) for an inbound-only agent. Skip A2P 10DLC registration — that’s for SMS, not voice.
The Hello-World TwiML
Phase A’s entire server is one route returning static XML:
app.post('/voice', (c) => {
const twiml = `<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Say voice="Polly.Joanna">Hello! You reached dialtone, Leo's phone agent. We are still under construction — goodbye for now.</Say>
<Hangup/>
</Response>`;
return c.text(twiml, 200, { 'content-type': 'text/xml' });
});<Say> runs Amazon Polly inside Twilio’s edge — the server never touches audio at this layer. That’s the whole point of Phase A: verify the control plane (HTTP webhook, DNS, tunnel, TwiML parsing) before piling on the data plane (real-time WebSocket audio).
Trial gotcha: verified caller ID only
A trial account only accepts inbound calls from numbers you’ve explicitly verified. So tempting alternatives like Viber Out or Skype Out fail silently — Twilio sees a generic carrier caller ID, not your registered number, and drops the call before your webhook fires. For the first smoke test, just dial directly from the verified phone (~$0.50 for a one-minute international call from VN to a US number; Twilio bills inbound in 1-minute increments rounded up).
First call
Dialed → ringing 2 s
→ trial disclaimer ~10 s ("Press any key to execute your code...")
→ "we are still under construction — dialtone, goodbye for now"
→ hangupThe “we...” cut-off is the trial disclaimer eating the first half of the greeting; in production it disappears. What matters is every layer in the chain fired exactly once, in order: PSTN → Twilio US1 → webhook engine → Cloudflare edge → cloudflared → Hono → TwiML → Polly → caller’s ear.
End of Phase A. The pipe is verified — every layer below the audio pipeline is now known-good.
Testing dead ends
Phase A confirmed the pipe works on a real PSTN call from my own phone. ~$0.50/call from VN to a US Twilio number is fine for a one-shot, painful for Phase B iteration where I’ll be making the call dozens of times a day. So I went looking for a cheaper test path.
Twilio Dev Phone — sounded ideal, didn’t work
Twilio ships a CLI plugin called Dev Phone that spins up a browser-based softphone using their Voice SDK. The pitch is great: in-browser dialer that calls into your Twilio number for ~$0.004/min, no PSTN, no carrier fees. I burned an afternoon on it. Three failure modes, in order:
Webhook collision. Dev Phone has to bind to a Twilio number to act as outbound caller ID and inbound destination. To do that it overwrites the number’s voice webhook. If your dialtone webhook lives on that same number, you have to pass --force, and the URL gets clobbered. Ctrl+C tries to restore — but the restore is best-effort and silently failed in my run, leaving the dialtone number with an empty voiceUrl until I noticed and ran the update API by hand.
Self-loop. Even with --force, dialing the dialtone number from a dev-phone bound to that same number creates a feedback loop: dev-phone sends the call out, Twilio routes it back through the (now dev-phone-controlled) webhook, and the call connects browser-to-itself. Five “completed” calls in the history, zero greeting played. The browser is talking to itself.
Sandbox number doesn’t help either. The textbook fix is to buy a second cheap number ($1.15/mo) and bind dev-phone to the sandbox, leaving the dialtone number’s webhook intact. I did that. Calls fail instantly:
Error 13225: Phone number is blacklisted.
Source: dev-phone Functions Service (outbound-call-handler)Twilio’s anti-fraud policy blocks Voice SDK calls from one number on an account to any other number on the same account. There is no self-serve toggle. The lesson: Dev Phone is a sandbox for testing your own number’s incoming call handling, not for poking another number you control.
Cleanup also isn’t clean. Every dev-phone session leaves four resources on the account — Functions Service, Sync Service, Conversations Service, TwiML App, all named dev-phone-XXXX — that survive past Ctrl+C and have to be deleted by hand:
twilio api:serverless:v1:services:remove --sid ZS...
twilio api:sync:v1:services:remove --sid IS...
twilio api:conversations:v1:services:remove --sid IS...
twilio api:core:applications:remove --sid AP...Harmless, but cluttering, and Twilio’s billing console will eventually ask you why you have 30 dead Functions Services.
What actually works
After upgrading the Twilio account ($20 deposit, prepaid balance — not a fee, used down on real usage), the trial disclaimer disappears and any caller can dial the number. Three viable iteration loops, ranked by cost:
| Method | Cost / 30 min iter | What it covers |
|---|---|---|
Mock POST with curl | $0 | TwiML response only (Phase A regression) |
| Replay captured Media Streams frames | $0, after one real-call capture | Full server logic, no live audio |
| VoIP-out app (Yolla, Localphone) from VN | ~$0.33 | Full real-PSTN, end-to-end |
The build pattern collapses to: capture one real call’s WebSocket frames into a fixture, replay the fixture for every code change, and only burn a real call when verifying a regression or doing a demo.
The lesson is broader than dialtone — when a vendor’s “developer experience” tool sits on a different abstraction layer than the production path you’re testing, it’s almost always faster to mock the production path than to bend the tool.
What Phase B changes
Same /voice route, different TwiML:
<Response>
<Connect>
<Stream url="wss://dialtone.leo6103.com/stream" />
</Connect>
</Response>Now Twilio opens a WebSocket to the server and starts shipping 20 ms μ-law frames as the caller speaks. The server stops being a static-XML responder and becomes a real-time audio relay: Twilio frames in → Deepgram → Claude → ElevenLabs → frames back out. That’s where the latency budget starts mattering, where backpressure shows up, and where the interesting bugs live.
…to be continued in Part 2.