Calls

Voice calls are real-time conversations through your agent’s phone numbers. Calls can be inbound (received) or outbound (initiated via API). Each call includes metadata like duration, status, and transcript. You can also stream transcripts in real time via Server-Sent Events.

How calls are handled depends on your agent’s voice mode.

Voice modes

Custom Webhook

voiceMode: "webhook" (default) — Caller speech is transcribed and sent to your webhook as agent.message events. Your server controls every response using any LLM, RAG, or custom logic.

Built-in AI

voiceMode: "hosted" — Calls are handled end-to-end by a built-in LLM using your systemPrompt. No webhook or server needed.

Switch modes at any time via PATCH /v1/agents/:id. The backend automatically re-provisions voice infrastructure and rebinds phone numbers with no downtime.

SMS is always webhook-based regardless of voice mode.

Call flow (webhook mode)

When voiceMode is "webhook":

1

Caller dials your number

The voice engine answers and begins streaming audio.

2

Caller speaks

Streaming STT transcribes in real-time and detects end of speech.

3

Transcript is sent to your webhook

We POST the transcript to your webhook with event: "agent.message" and channel: "voice", including recentHistory for context.

4

Your server responds

You process the transcript (e.g., send to your LLM) and return a response. We strongly recommend streaming NDJSON — TTS starts speaking on the first chunk.

5

TTS speaks the response

Each NDJSON chunk is spoken with sub-second latency. No waiting for the full response.

6

Conversation continues

The caller can interrupt at any time (barge-in). The cycle repeats naturally.

Call flow (built-in AI mode)

When voiceMode is "hosted":

1

Caller dials your number

The AI answers with your beginMessage (e.g., “Hello! How can I help?”).

2

Caller speaks

Streaming STT transcribes in real-time.

3

Built-in LLM generates a response

The LLM uses your systemPrompt to generate a contextual response.

4

TTS speaks the response

Streaming TTS speaks the response with sub-second latency.

5

Conversation continues

No server or webhook involved — the platform handles everything.

Voice capabilities

Both modes share the same low-latency engine:

CapabilityDescription
Streaming STTReal-time speech-to-text transcription
Streaming TTSSub-second text-to-speech synthesis
Barge-inCaller can interrupt the agent mid-sentence
BackchannelingNatural conversational cues (“uh-huh”, “right”)
Turn detectionSmart end-of-speech detection
Streaming responsesReturn NDJSON to start TTS on the first chunk
Live transcript streamingStream transcript turns in real time via SSE — works for live and completed calls
DTMF digit pressPress keypad digits to navigate IVR menus and automated phone systems
Call recordingOptional add-on — automatically records calls and provides audio URLs

Webhook response format

For voice webhooks, your server must return a JSON object ({...}) telling the agent what to say. Non-object responses (numbers, strings, arrays) are ignored and the caller hears silence.

Return Content-Type: application/x-ndjson with newline-delimited JSON chunks. TTS starts speaking on the very first chunk while your server continues processing.

{"text": "Let me check that for you.", "interim": true}
{"text": "Your order #4521 shipped yesterday via FedEx."}

Mark interim chunks with "interim": true — the final chunk (without interim) closes the turn. Use this for tool calls, LLM token forwarding, or any time your response takes more than ~1 second.

Simple response

Return a single JSON object for instant replies where no processing delay is expected.

1{ "text": "How can I help you?" }

Response fields

FieldTypeDescription
textstringText to speak to the caller
hangupbooleanSet to true to end the call after speaking
actionstring"transfer" to cold-transfer the call (requires transferNumber on the agent), "hangup" to end it
digitsstringDTMF digits to press on the keypad (e.g. "1", "123", "1*#"). Used to navigate IVR menus and automated phone systems. Aliases: press_digit, dtmf
interimbooleanNDJSON only — marks a chunk as interim (TTS speaks it but the turn stays open)
Webhook timeout

Voice webhook requests have a 30-second default timeout (configurable from 5–120 seconds per webhook via the timeout field). If your server doesn’t start responding in time, the request is cancelled and the caller hears silence for that turn. This is especially important when your webhook calls external APIs or runs LLM tool calls — always stream an interim chunk immediately so the caller hears something while you process.

Example: streaming handler (Python / FastAPI)

1from fastapi.responses import StreamingResponse
2import json, openai
3
4@app.post('/webhook')
5async def handle_voice(payload: dict):
6 if payload['channel'] != 'voice':
7 return Response(status_code=200)
8
9 history = payload.get('recentHistory', [])
10 context = "\n".join([
11 f"{'Customer' if h['direction'] == 'inbound' else 'Agent'}: {h['content']}"
12 for h in history
13 ])
14
15 async def generate():
16 yield json.dumps({"text": "One moment, let me check.", "interim": True}) + "\n"
17
18 stream = openai.chat.completions.create(
19 model="gpt-4",
20 stream=True,
21 messages=[
22 {"role": "system", "content": "You are a helpful phone agent."},
23 {"role": "user", "content": f"Conversation:\n{context}\n\nRespond."}
24 ]
25 )
26 full = ""
27 for chunk in stream:
28 delta = chunk.choices[0].delta.content or ""
29 full += delta
30 yield json.dumps({"text": full}) + "\n"
31
32 return StreamingResponse(generate(), media_type="application/x-ndjson")

Example: streaming handler (Node.js / Express)

1const OpenAI = require('openai');
2const openai = new OpenAI();
3
4app.post('/webhook', express.json(), async (req, res) => {
5 if (req.body.channel !== 'voice') return res.status(200).send('OK');
6
7 const history = req.body.recentHistory || [];
8 const context = history
9 .map(h => `${h.direction === 'inbound' ? 'Customer' : 'Agent'}: ${h.content}`)
10 .join('\n');
11
12 res.setHeader('Content-Type', 'application/x-ndjson');
13 res.write(JSON.stringify({ text: 'One moment, let me check.', interim: true }) + '\n');
14
15 const stream = await openai.chat.completions.create({
16 model: 'gpt-4',
17 stream: true,
18 messages: [
19 { role: 'system', content: 'You are a helpful phone agent.' },
20 { role: 'user', content: `Conversation:\n${context}\n\nRespond.` }
21 ]
22 });
23
24 let full = '';
25 for await (const chunk of stream) {
26 full += chunk.choices[0]?.delta?.content || '';
27 }
28 res.write(JSON.stringify({ text: full }) + '\n');
29 res.end();
30});

Example: tool-calling handler (Python / Flask)

When your agent needs to call external APIs (databases, calendars, CRM, etc.) during a voice call, always stream an interim filler response first. This prevents the caller from hearing silence while your tools run.

The pattern is: stream an interim acknowledgement immediately → run your tools → stream the final answer.

voice_tools_webhook.py
1from flask import Flask, request, Response
2import json, anthropic, os
3
4app = Flask(__name__)
5client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
6
7TOOLS = [
8 {
9 "name": "get_todays_calendar",
10 "description": "Get the user's calendar events for today.",
11 "input_schema": {"type": "object", "properties": {}, "required": []},
12 },
13 {
14 "name": "search_orders",
15 "description": "Look up a customer's recent orders.",
16 "input_schema": {
17 "type": "object",
18 "properties": {"query": {"type": "string"}},
19 "required": ["query"],
20 },
21 },
22]
23
24TOOL_HANDLERS = {
25 "get_todays_calendar": lambda args: fetch_calendar_events(),
26 "search_orders": lambda args: search_order_db(args["query"]),
27}
28
29
30def run_tool_call(user_message: str, history: list) -> str:
31 """Run Claude with tools and return the final text response."""
32 messages = [{"role": "user", "content": user_message}]
33
34 for _ in range(5): # max tool-call iterations
35 response = client.messages.create(
36 model="claude-haiku-4-5-20251001",
37 max_tokens=256,
38 system="You are a helpful phone assistant. Keep responses to 2-3 sentences.",
39 tools=TOOLS,
40 messages=messages,
41 )
42
43 if response.stop_reason == "tool_use":
44 messages.append({"role": "assistant", "content": response.content})
45 tool_results = []
46 for block in response.content:
47 if block.type == "tool_use":
48 handler = TOOL_HANDLERS.get(block.name)
49 result = handler(block.input) if handler else "Unknown tool"
50 tool_results.append({
51 "type": "tool_result",
52 "tool_use_id": block.id,
53 "content": result,
54 })
55 messages.append({"role": "user", "content": tool_results})
56 else:
57 return " ".join(b.text for b in response.content if hasattr(b, "text"))
58
59 return "Sorry, I'm having trouble processing that."
60
61
62@app.post("/webhook")
63def webhook():
64 payload = request.json
65 if payload.get("channel") != "voice":
66 return "OK", 200
67
68 transcript = payload["data"].get("transcript", "")
69 history = payload.get("recentHistory", [])
70
71 def generate():
72 # Immediately tell the caller we're working on it
73 yield json.dumps({"text": "Let me check on that.", "interim": True}) + "\n"
74
75 # Now run the slow tool calls (LLM + external APIs)
76 try:
77 answer = run_tool_call(transcript, history)
78 except Exception:
79 answer = "Sorry, I ran into a problem. Could you try again?"
80
81 yield json.dumps({"text": answer}) + "\n"
82
83 return Response(generate(), content_type="application/x-ndjson")

Example: tool-calling handler (Node.js / Express)

voice_tools_webhook.js
1const express = require("express");
2const Anthropic = require("@anthropic-ai/sdk");
3
4const app = express();
5app.use(express.json());
6
7const client = new Anthropic();
8
9const tools = [
10 {
11 name: "get_todays_calendar",
12 description: "Get the user's calendar events for today.",
13 input_schema: { type: "object", properties: {}, required: [] },
14 },
15 {
16 name: "search_orders",
17 description: "Look up a customer's recent orders.",
18 input_schema: {
19 type: "object",
20 properties: { query: { type: "string" } },
21 required: ["query"],
22 },
23 },
24];
25
26const toolHandlers = {
27 get_todays_calendar: (args) => fetchCalendarEvents(),
28 search_orders: (args) => searchOrderDb(args.query),
29};
30
31async function runToolCall(userMessage) {
32 const messages = [{ role: "user", content: userMessage }];
33
34 for (let i = 0; i < 5; i++) {
35 const response = await client.messages.create({
36 model: "claude-haiku-4-5-20251001",
37 max_tokens: 256,
38 system: "You are a helpful phone assistant. Keep responses to 2-3 sentences.",
39 tools,
40 messages,
41 });
42
43 if (response.stop_reason === "tool_use") {
44 messages.push({ role: "assistant", content: response.content });
45 const toolResults = [];
46 for (const block of response.content) {
47 if (block.type === "tool_use") {
48 const handler = toolHandlers[block.name];
49 const result = handler ? await handler(block.input) : "Unknown tool";
50 toolResults.push({ type: "tool_result", tool_use_id: block.id, content: result });
51 }
52 }
53 messages.push({ role: "user", content: toolResults });
54 } else {
55 return response.content
56 .filter((b) => b.type === "text")
57 .map((b) => b.text)
58 .join(" ");
59 }
60 }
61 return "Sorry, I'm having trouble processing that.";
62}
63
64app.post("/webhook", async (req, res) => {
65 if (req.body.channel !== "voice") return res.status(200).send("OK");
66
67 const transcript = req.body.data?.transcript || "";
68
69 res.setHeader("Content-Type", "application/x-ndjson");
70
71 // Immediately tell the caller we're working on it
72 res.write(JSON.stringify({ text: "Let me check on that.", interim: true }) + "\n");
73
74 // Now run the slow tool calls (LLM + external APIs)
75 try {
76 const answer = await runToolCall(transcript);
77 res.write(JSON.stringify({ text: answer }) + "\n");
78 } catch (err) {
79 res.write(JSON.stringify({ text: "Sorry, I ran into a problem." }) + "\n");
80 }
81 res.end();
82});
83
84app.listen(3000);
Why interim chunks matter for tool calls

Without the interim chunk, the caller hears dead silence while your LLM decides which tool to call, the external API responds, and the LLM summarises the result. With streaming, they hear “Let me check on that” within milliseconds — just like a human assistant would.


Troubleshooting voice calls

Common issues and how to fix them.

Caller hears silence after speaking

Your webhook is too slow or not responding. Voice webhooks have a 30-second default timeout (configurable per webhook from 5–120 seconds). If your server doesn’t respond in time, the turn is dropped and the caller hears nothing.

Fix: Always stream an interim NDJSON chunk immediately (e.g. {"text": "One moment.", "interim": true}) before doing any slow work. This buys you time while keeping the caller engaged.

Common causes:

  • LLM tool calls that take too long (external API latency + LLM processing)
  • Cold starts on serverless platforms (Lambda, Cloud Functions)
  • Webhook URL is unreachable or returning errors

Caller hears silence after the greeting

Your webhook isn’t configured or isn’t returning a valid JSON object. Voice responses must be a JSON object ({...}). Non-object responses (strings, arrays, numbers) are ignored.

Fix: Verify your webhook is returning {"text": "..."}. Use POST /v1/webhooks/test to confirm your endpoint is reachable and responding correctly.

Response is cut off or sounds garbled

You’re sending the entire response as a single large chunk. Long responses in a single chunk can cause TTS delays.

Fix: Use NDJSON streaming and break responses into natural sentences. Send each sentence as an interim chunk so TTS can start speaking immediately.

Agent speaks XML or code artifacts

Your LLM is including tool-call markup in its response. Some LLMs emit <function_call> or similar tags.

Fix: Strip non-speech content from your LLM output before returning it. AgentPhone removes common patterns automatically, but your webhook should clean responses to be safe.

Webhook works for SMS but not voice

You’re returning a 200 OK with no body, or a non-JSON response for voice. SMS webhooks only need a 200 status — voice webhooks must return a JSON object with a text field.

Fix: Check the channel field in the webhook payload. For "voice", always return {"text": "..."}. For "sms", a 200 OK is sufficient.


Call recording

Call recording is an optional add-on that saves audio recordings of your voice calls. When enabled, completed calls include a recordingUrl field with a link to the audio file.

FieldTypeDescription
recordingUrlstring or nullURL to the call recording audio file. Only populated when the recording add-on is enabled.
recordingAvailablebooleanWhether a recording exists for this call. Can be true even when recordingUrl is null (recording exists but the add-on is not active).

Enable recording from the Billing page in the dashboard. See Usage & Billing for pricing.

Recordings are captured automatically for all calls while the add-on is active. If you disable the add-on, existing recordings are preserved but recordingUrl will be null until you re-enable it.


List calls

List all calls for this project.

GET /v1/calls

Query parameters

ParameterTypeRequiredDefaultDescription
limitintegerNo20Number of results to return (max 100)
offsetintegerNo0Number of results to skip (min 0)
statusstringNoFilter by status: completed, in-progress, failed
directionstringNoFilter by direction: inbound, outbound, web
searchstringNoSearch by phone number (matches fromNumber or toNumber)

Example

$curl -X GET "https://api.agentphone.to/v1/calls?limit=10&offset=0" \
> -H "Authorization: Bearer YOUR_API_KEY"
1{
2 "data": [
3 {
4 "id": "call_ghi012",
5 "agentId": "agt_abc123",
6 "phoneNumberId": "num_xyz789",
7 "phoneNumber": "+15551234567",
8 "fromNumber": "+15559876543",
9 "toNumber": "+15551234567",
10 "direction": "inbound",
11 "status": "completed",
12 "startedAt": "2025-01-15T14:00:00Z",
13 "endedAt": "2025-01-15T14:05:30Z",
14 "durationSeconds": 330,
15 "lastTranscriptSnippet": "Thank you for calling, goodbye!",
16 "recordingUrl": "https://api.twilio.com/2010-04-01/.../Recordings/RE...",
17 "recordingAvailable": true
18 }
19 ],
20 "hasMore": false,
21 "total": 1
22}

Get call

Get details of a specific call, including its full transcript.

GET /v1/calls/{call_id}

Example

$curl -X GET "https://api.agentphone.to/v1/calls/call_ghi012" \
> -H "Authorization: Bearer YOUR_API_KEY"
1{
2 "id": "call_ghi012",
3 "agentId": "agt_abc123",
4 "phoneNumberId": "num_xyz789",
5 "phoneNumber": "+15551234567",
6 "fromNumber": "+15559876543",
7 "toNumber": "+15551234567",
8 "direction": "inbound",
9 "status": "completed",
10 "startedAt": "2025-01-15T14:00:00Z",
11 "endedAt": "2025-01-15T14:05:30Z",
12 "durationSeconds": 330,
13 "recordingUrl": "https://api.twilio.com/2010-04-01/.../Recordings/RE...",
14 "recordingAvailable": true,
15 "transcripts": [
16 {
17 "id": "tr_001",
18 "transcript": "Hello! Thanks for calling Acme Corp. How can I help you today?",
19 "confidence": 0.95,
20 "response": "Sure! Could you please provide your order number?",
21 "createdAt": "2025-01-15T14:00:05Z"
22 },
23 {
24 "id": "tr_002",
25 "transcript": "Hi, I'd like to check the status of my order.",
26 "confidence": 0.92,
27 "response": "Of course! Let me look that up for you.",
28 "createdAt": "2025-01-15T14:00:15Z"
29 }
30 ]
31}

Stream transcript (SSE)

Stream a call’s transcript in real time via Server-Sent Events. On connect the server replays all existing transcript turns from the database, then streams new turns as they arrive from the voice engine. Works for both live and completed calls — same URL either way.

GET /v1/calls/{call_id}/transcript/stream

The response is an SSE stream (Content-Type: text/event-stream) with the following event types:

EventDescription
connectedSent once on connect with call metadata (callId, agentId, direction, etc.)
turnA transcript turn — either replayed from history or arriving in real time. Contains role ("user" or "agent"), content, and createdAt.
endedThe call has ended. Contains callId, status, endedAt, and durationSeconds. The stream closes after this event.

A : heartbeat comment is sent every 15 seconds to keep proxies and load balancers from closing the connection.

Event payloads

connected

1{
2 "callId": "call_ghi012",
3 "status": "in-progress",
4 "agentId": "agt_abc123",
5 "agentName": "Support Bot",
6 "direction": "inbound",
7 "fromNumber": "+15559876543",
8 "toNumber": "+15551234567",
9 "startedAt": "2025-01-15T14:00:00Z"
10}

turn

1{
2 "role": "user",
3 "content": "I need help with my order.",
4 "createdAt": "2025-01-15T14:00:10Z"
5}

ended

1{
2 "callId": "call_ghi012",
3 "status": "completed",
4 "endedAt": "2025-01-15T14:05:30Z",
5 "durationSeconds": 330
6}

Example: curl

$curl -N "https://api.agentphone.to/v1/calls/call_ghi012/transcript/stream" \
> -H "Authorization: Bearer YOUR_API_KEY"

Example: JavaScript (Node.js / fetch stream)

1const API_KEY = "YOUR_API_KEY";
2const callId = "call_ghi012";
3const response = await fetch(
4 `https://api.agentphone.to/v1/calls/${callId}/transcript/stream`,
5 { headers: { Authorization: `Bearer ${API_KEY}` } }
6);
7
8if (!response.ok) {
9 throw new Error(`Failed to stream transcript: ${response.status}`);
10}
11
12const reader = response.body.getReader();
13const decoder = new TextDecoder();
14let buffer = "";
15let eventType = null;
16
17while (true) {
18 const { done, value } = await reader.read();
19 if (done) break;
20
21 buffer += decoder.decode(value, { stream: true });
22 const lines = buffer.split("\n");
23 buffer = lines.pop();
24
25 for (const line of lines) {
26 if (line.startsWith("event:")) {
27 eventType = line.slice(6).trim();
28 } else if (line.startsWith("data:")) {
29 const data = JSON.parse(line.slice(5).trim());
30 if (eventType === "connected") {
31 console.log(`Connected to call ${data.callId} (${data.status})`);
32 } else if (eventType === "turn") {
33 console.log(`[${data.role}] ${data.content}`);
34 } else if (eventType === "ended") {
35 console.log(`Call ended - ${data.durationSeconds}s`);
36 }
37 }
38 }
39}

Example: Python

1import requests
2
3API_KEY = "YOUR_API_KEY"
4call_id = "call_ghi012"
5
6url = f"https://api.agentphone.to/v1/calls/{call_id}/transcript/stream"
7headers = {"Authorization": f"Bearer {API_KEY}"}
8
9with requests.get(url, headers=headers, stream=True) as resp:
10 event_type = None
11 for line in resp.iter_lines(decode_unicode=True):
12 if line.startswith("event:"):
13 event_type = line[len("event:"):].strip()
14 elif line.startswith("data:"):
15 import json
16 data = json.loads(line[len("data:"):].strip())
17 if event_type == "turn":
18 print(f"[{data['role']}] {data['content']}")
19 elif event_type == "ended":
20 print(f"Call ended — {data['durationSeconds']}s")
21 break
Live and completed calls use the same URL

For a live call, existing turns are replayed first, then the stream stays open and delivers new turns in real time until the call ends. For a completed call, all turns are replayed immediately followed by an ended event and the stream closes. Your client code doesn’t need to differentiate — just handle the events.


Create outbound call

Initiate an outbound voice call from one of your agent’s phone numbers. The agent’s first assigned phone number is used as the caller ID.

POST /v1/calls

Request body

FieldTypeRequiredDescription
agentIdstringYesThe agent that will handle the call. Its first assigned phone number is used as caller ID.
toNumberstringYesThe phone number to call (E.164 format, e.g., "+15559876543")
initialGreetingstring or nullNoOptional greeting to speak when the recipient answers
voicestringNoVoice to use for speaking (default: "Polly.Amy")
systemPromptstring or nullNoWhen provided, uses a built-in LLM for the conversation instead of forwarding to your webhook.
fromNumberIdstring or nullNoThe ID of a specific phone number to use as the caller ID. If omitted, the system selects one of the agent’s attached numbers.

Example

$curl -X POST "https://api.agentphone.to/v1/calls" \
> -H "Authorization: Bearer YOUR_API_KEY" \
> -H "Content-Type: application/json" \
> -d '{
> "agentId": "agt_abc123",
> "toNumber": "+15559876543",
> "initialGreeting": "Hi, this is Acme Corp calling about your recent order.",
> "systemPrompt": "You are a friendly support agent from Acme Corp."
> }'

Web calls

Web calls let users talk to your agent directly from a browser, no phone number needed. Use the agentphone-web-sdk npm package on the frontend and mint access tokens from your backend.

How it works:

  1. Your backend calls POST /v1/calls/web with agentId to get an access token
  2. The token is valid for 30 seconds. Pass it to the frontend immediately.
  3. The frontend uses agentphone-web-sdk to start the call with the token
POST /v1/calls/web

Request body

FieldTypeRequiredDescription
agentIdstringYesThe agent to call
variablesobjectNoTemplate variables for hosted-mode agents. Referenced in the system prompt as {{var_name}}.

Response

FieldTypeDescription
accessTokenstringShort-lived token for the web SDK (expires in 30 seconds)
callIdstringThe call ID for this session

Example

$curl -X POST "https://api.agentphone.to/v1/calls/web" \
> -H "Authorization: Bearer YOUR_API_KEY" \
> -H "Content-Type: application/json" \
> -d '{"agentId": "agt_abc123"}'

The call direction will be web (in addition to inbound and outbound).

List calls for number

List all calls associated with a specific phone number.

GET /v1/numbers/{number_id}/calls

Example

$curl -X GET "https://api.agentphone.to/v1/numbers/num_xyz789/calls?limit=10" \
> -H "Authorization: Bearer YOUR_API_KEY"