Calls
Voice calls are real-time conversations through your agent’s phone numbers. Calls can be inbound (received) or outbound (initiated via API). Each call includes metadata like duration, status, and transcript. You can also stream transcripts in real time via Server-Sent Events.
How calls are handled depends on your agent’s voice mode.
Voice modes
voiceMode: "webhook" (default) — Caller speech is transcribed and sent to your webhook as agent.message events. Your server controls every response using any LLM, RAG, or custom logic.
voiceMode: "hosted" — Calls are handled end-to-end by a built-in LLM using your systemPrompt. No webhook or server needed.
Switch modes at any time via PATCH /v1/agents/:id. The backend automatically re-provisions voice infrastructure and rebinds phone numbers with no downtime.
SMS is always webhook-based regardless of voice mode.
Call flow (webhook mode)
When voiceMode is "webhook":
Transcript is sent to your webhook
We POST the transcript to your webhook with event: "agent.message" and channel: "voice", including recentHistory for context.
Your server responds
You process the transcript (e.g., send to your LLM) and return a response. We strongly recommend streaming NDJSON — TTS starts speaking on the first chunk.
Call flow (built-in AI mode)
When voiceMode is "hosted":
Voice capabilities
Both modes share the same low-latency engine:
Webhook response format
For voice webhooks, your server must return a JSON object ({...}) telling the agent what to say. Non-object responses (numbers, strings, arrays) are ignored and the caller hears silence.
Streaming response (recommended)
Return Content-Type: application/x-ndjson with newline-delimited JSON chunks. TTS starts speaking on the very first chunk while your server continues processing.
Mark interim chunks with "interim": true — the final chunk (without interim) closes the turn. Use this for tool calls, LLM token forwarding, or any time your response takes more than ~1 second.
Simple response
Return a single JSON object for instant replies where no processing delay is expected.
Response fields
Webhook timeout
Voice webhook requests have a 30-second default timeout (configurable from 5–120 seconds per webhook via the timeout field). If your server doesn’t start responding in time, the request is cancelled and the caller hears silence for that turn. This is especially important when your webhook calls external APIs or runs LLM tool calls — always stream an interim chunk immediately so the caller hears something while you process.
Example: streaming handler (Python / FastAPI)
Example: streaming handler (Node.js / Express)
Example: tool-calling handler (Python / Flask)
When your agent needs to call external APIs (databases, calendars, CRM, etc.) during a voice call, always stream an interim filler response first. This prevents the caller from hearing silence while your tools run.
The pattern is: stream an interim acknowledgement immediately → run your tools → stream the final answer.
Example: tool-calling handler (Node.js / Express)
Why interim chunks matter for tool calls
Without the interim chunk, the caller hears dead silence while your LLM decides which tool to call, the external API responds, and the LLM summarises the result. With streaming, they hear “Let me check on that” within milliseconds — just like a human assistant would.
Troubleshooting voice calls
Common issues and how to fix them.
Caller hears silence after speaking
Your webhook is too slow or not responding. Voice webhooks have a 30-second default timeout (configurable per webhook from 5–120 seconds). If your server doesn’t respond in time, the turn is dropped and the caller hears nothing.
Fix: Always stream an interim NDJSON chunk immediately (e.g. {"text": "One moment.", "interim": true}) before doing any slow work. This buys you time while keeping the caller engaged.
Common causes:
- LLM tool calls that take too long (external API latency + LLM processing)
- Cold starts on serverless platforms (Lambda, Cloud Functions)
- Webhook URL is unreachable or returning errors
Caller hears silence after the greeting
Your webhook isn’t configured or isn’t returning a valid JSON object. Voice responses must be a JSON object ({...}). Non-object responses (strings, arrays, numbers) are ignored.
Fix: Verify your webhook is returning {"text": "..."}. Use POST /v1/webhooks/test to confirm your endpoint is reachable and responding correctly.
Response is cut off or sounds garbled
You’re sending the entire response as a single large chunk. Long responses in a single chunk can cause TTS delays.
Fix: Use NDJSON streaming and break responses into natural sentences. Send each sentence as an interim chunk so TTS can start speaking immediately.
Agent speaks XML or code artifacts
Your LLM is including tool-call markup in its response. Some LLMs emit <function_call> or similar tags.
Fix: Strip non-speech content from your LLM output before returning it. AgentPhone removes common patterns automatically, but your webhook should clean responses to be safe.
Webhook works for SMS but not voice
You’re returning a 200 OK with no body, or a non-JSON response for voice. SMS webhooks only need a 200 status — voice webhooks must return a JSON object with a text field.
Fix: Check the channel field in the webhook payload. For "voice", always return {"text": "..."}. For "sms", a 200 OK is sufficient.
Call recording
Call recording is an optional add-on that saves audio recordings of your voice calls. When enabled, completed calls include a recordingUrl field with a link to the audio file.
Enable recording from the Billing page in the dashboard. See Usage & Billing for pricing.
Recordings are captured automatically for all calls while the add-on is active. If you disable the add-on, existing recordings are preserved but recordingUrl will be null until you re-enable it.
List calls
List all calls for this project.
Query parameters
Example
Get call
Get details of a specific call, including its full transcript.
Example
Stream transcript (SSE)
Stream a call’s transcript in real time via Server-Sent Events. On connect the server replays all existing transcript turns from the database, then streams new turns as they arrive from the voice engine. Works for both live and completed calls — same URL either way.
The response is an SSE stream (Content-Type: text/event-stream) with the following event types:
A : heartbeat comment is sent every 15 seconds to keep proxies and load balancers from closing the connection.
Event payloads
connected
turn
ended
Example: curl
Example: JavaScript (Node.js / fetch stream)
Example: Python
Live and completed calls use the same URL
For a live call, existing turns are replayed first, then the stream stays open and delivers new turns in real time until the call ends. For a completed call, all turns are replayed immediately followed by an ended event and the stream closes. Your client code doesn’t need to differentiate — just handle the events.
Create outbound call
Initiate an outbound voice call from one of your agent’s phone numbers. The agent’s first assigned phone number is used as the caller ID.
Request body
Example
Web calls
Web calls let users talk to your agent directly from a browser, no phone number needed. Use the agentphone-web-sdk npm package on the frontend and mint access tokens from your backend.
How it works:
- Your backend calls
POST /v1/calls/webwithagentIdto get an access token - The token is valid for 30 seconds. Pass it to the frontend immediately.
- The frontend uses
agentphone-web-sdkto start the call with the token
Request body
Response
Example
The call direction will be web (in addition to inbound and outbound).
List calls for number
List all calls associated with a specific phone number.
