Calls | AgentPhone

Voice calls are real-time conversations through your agent’s phone numbers. Calls can be inbound (received) or outbound (initiated via API). Each call includes metadata like duration, status, and transcript. You can also stream transcripts in real time via Server-Sent Events.

How calls are handled depends on your agent’s voice mode.

Voice modes

Custom Webhook

voiceMode: "webhook" (default) — Caller speech is transcribed and sent to your webhook as agent.message events. Your server controls every response using any LLM, RAG, or custom logic.

Built-in AI

voiceMode: "hosted" — Calls are handled end-to-end by a built-in LLM using your systemPrompt. No webhook or server needed.

Switch modes at any time via PATCH /v1/agents/:id. The backend automatically re-provisions voice infrastructure and rebinds phone numbers with no downtime.

SMS is always webhook-based regardless of voice mode.

Call flow (webhook mode)

When voiceMode is "webhook":

Caller dials your number

The voice engine answers and begins streaming audio.

Caller speaks

Streaming STT transcribes in real-time and detects end of speech.

Transcript is sent to your webhook

We POST the transcript to your webhook with event: "agent.message" and channel: "voice", including recentHistory for context.

Your server responds

You process the transcript (e.g., send to your LLM) and return a response. We strongly recommend streaming NDJSON — TTS starts speaking on the first chunk.

TTS speaks the response

Each NDJSON chunk is spoken with sub-second latency. No waiting for the full response.

Conversation continues

The caller can interrupt at any time (barge-in). The cycle repeats naturally.

Call flow (built-in AI mode)

When voiceMode is "hosted":

Caller dials your number

The AI answers with your beginMessage (e.g., “Hello! How can I help?”).

Caller speaks

Streaming STT transcribes in real-time.

Built-in LLM generates a response

The LLM uses your systemPrompt to generate a contextual response.

TTS speaks the response

Streaming TTS speaks the response with sub-second latency.

Conversation continues

No server or webhook involved — the platform handles everything.

Voice capabilities

Both modes share the same low-latency engine:

Capability	Description
Streaming STT	Real-time speech-to-text transcription
Streaming TTS	Sub-second text-to-speech synthesis
Barge-in	Caller can interrupt the agent mid-sentence
Backchanneling	Natural conversational cues (“uh-huh”, “right”)
Turn detection	Smart end-of-speech detection
Streaming responses	Return NDJSON to start TTS on the first chunk
Live transcript streaming	Stream transcript turns in real time via SSE — works for live and completed calls
DTMF digit press	Press keypad digits to navigate IVR menus and automated phone systems
Call recording	Optional add-on — automatically records calls and provides audio URLs

Webhook response format

For voice webhooks, your server must return a JSON object ({...}) telling the agent what to say. Non-object responses (numbers, strings, arrays) are ignored and the caller hears silence.

Streaming response (recommended)

Return Content-Type: application/x-ndjson with newline-delimited JSON chunks. TTS starts speaking on the very first chunk while your server continues processing.

{"text": "Let me check that for you.", "interim": true}
{"text": "Your order #4521 shipped yesterday via FedEx."}

Mark interim chunks with "interim": true — the final chunk (without interim) closes the turn. Use this for tool calls, LLM token forwarding, or any time your response takes more than ~1 second.

Simple response

Return a single JSON object for instant replies where no processing delay is expected.

1 { "text": "How can I help you?" }

Response fields

Field	Type	Description
`text`	string	Text to speak to the caller
`hangup`	boolean	Set to `true` to end the call after speaking
`action`	string	`"transfer"` to cold-transfer the call (requires `transferNumber` on the agent), `"hangup"` to end it
`digits`	string	DTMF digits to press on the keypad (e.g. `"1"`, `"123"`, `"1*#"`). Used to navigate IVR menus and automated phone systems. Aliases: `press_digit`, `dtmf`
`interim`	boolean	NDJSON only — marks a chunk as interim (TTS speaks it but the turn stays open)

Webhook timeout

Voice webhook requests have a 30-second default timeout (configurable from 5–120 seconds per webhook via the timeout field). If your server doesn’t start responding in time, the request is cancelled and the caller hears silence for that turn. This is especially important when your webhook calls external APIs or runs LLM tool calls — always stream an interim chunk immediately so the caller hears something while you process.

Example: streaming handler (Python / FastAPI)

1 from fastapi.responses import StreamingResponse
2 import json, openai
3 
4 @app.post('/webhook')
5 async def handle_voice(payload: dict):
6     if payload['channel'] != 'voice':
7         return Response(status_code=200)
8 
9     history = payload.get('recentHistory', [])
10     context = "\n".join([
11         f"{'Customer' if h['direction'] == 'inbound' else 'Agent'}: {h['content']}"
12         for h in history
13     ])
14 
15     async def generate():
16         yield json.dumps({"text": "One moment, let me check.", "interim": True}) + "\n"
17 
18         stream = openai.chat.completions.create(
19             model="gpt-4",
20             stream=True,
21             messages=[
22                 {"role": "system", "content": "You are a helpful phone agent."},
23                 {"role": "user", "content": f"Conversation:\n{context}\n\nRespond."}
24             ]
25         )
26         full = ""
27         for chunk in stream:
28             delta = chunk.choices[0].delta.content or ""
29             full += delta
30         yield json.dumps({"text": full}) + "\n"
31 
32     return StreamingResponse(generate(), media_type="application/x-ndjson")

Example: streaming handler (Node.js / Express)

1 const OpenAI = require('openai');
2 const openai = new OpenAI();
3 
4 app.post('/webhook', express.json(), async (req, res) => {
5   if (req.body.channel !== 'voice') return res.status(200).send('OK');
6 
7   const history = req.body.recentHistory || [];
8   const context = history
9     .map(h => `${h.direction === 'inbound' ? 'Customer' : 'Agent'}: ${h.content}`)
10     .join('\n');
11 
12   res.setHeader('Content-Type', 'application/x-ndjson');
13   res.write(JSON.stringify({ text: 'One moment, let me check.', interim: true }) + '\n');
14 
15   const stream = await openai.chat.completions.create({
16     model: 'gpt-4',
17     stream: true,
18     messages: [
19       { role: 'system', content: 'You are a helpful phone agent.' },
20       { role: 'user', content: `Conversation:\n${context}\n\nRespond.` }
21     ]
22   });
23 
24   let full = '';
25   for await (const chunk of stream) {
26     full += chunk.choices[0]?.delta?.content || '';
27   }
28   res.write(JSON.stringify({ text: full }) + '\n');
29   res.end();
30 });

Example: tool-calling handler (Python / Flask)

When your agent needs to call external APIs (databases, calendars, CRM, etc.) during a voice call, always stream an interim filler response first. This prevents the caller from hearing silence while your tools run.

The pattern is: stream an interim acknowledgement immediately → run your tools → stream the final answer.

voice_tools_webhook.py

1 from flask import Flask, request, Response
2 import json, anthropic, os
3 
4 app = Flask(__name__)
5 client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
6 
7 TOOLS = [
8     {
9         "name": "get_todays_calendar",
10         "description": "Get the user's calendar events for today.",
11         "input_schema": {"type": "object", "properties": {}, "required": []},
12     },
13     {
14         "name": "search_orders",
15         "description": "Look up a customer's recent orders.",
16         "input_schema": {
17             "type": "object",
18             "properties": {"query": {"type": "string"}},
19             "required": ["query"],
20         },
21     },
22 ]
23 
24 TOOL_HANDLERS = {
25     "get_todays_calendar": lambda args: fetch_calendar_events(),
26     "search_orders": lambda args: search_order_db(args["query"]),
27 }
28 
29 
30 def run_tool_call(user_message: str, history: list) -> str:
31     """Run Claude with tools and return the final text response."""
32     messages = [{"role": "user", "content": user_message}]
33 
34     for _ in range(5):  # max tool-call iterations
35         response = client.messages.create(
36             model="claude-haiku-4-5-20251001",
37             max_tokens=256,
38             system="You are a helpful phone assistant. Keep responses to 2-3 sentences.",
39             tools=TOOLS,
40             messages=messages,
41         )
42 
43         if response.stop_reason == "tool_use":
44             messages.append({"role": "assistant", "content": response.content})
45             tool_results = []
46             for block in response.content:
47                 if block.type == "tool_use":
48                     handler = TOOL_HANDLERS.get(block.name)
49                     result = handler(block.input) if handler else "Unknown tool"
50                     tool_results.append({
51                         "type": "tool_result",
52                         "tool_use_id": block.id,
53                         "content": result,
54                     })
55             messages.append({"role": "user", "content": tool_results})
56         else:
57             return " ".join(b.text for b in response.content if hasattr(b, "text"))
58 
59     return "Sorry, I'm having trouble processing that."
60 
61 
62 @app.post("/webhook")
63 def webhook():
64     payload = request.json
65     if payload.get("channel") != "voice":
66         return "OK", 200
67 
68     transcript = payload["data"].get("transcript", "")
69     history = payload.get("recentHistory", [])
70 
71     def generate():
72         # Immediately tell the caller we're working on it
73         yield json.dumps({"text": "Let me check on that.", "interim": True}) + "\n"
74 
75         # Now run the slow tool calls (LLM + external APIs)
76         try:
77             answer = run_tool_call(transcript, history)
78         except Exception:
79             answer = "Sorry, I ran into a problem. Could you try again?"
80 
81         yield json.dumps({"text": answer}) + "\n"
82 
83     return Response(generate(), content_type="application/x-ndjson")

Example: tool-calling handler (Node.js / Express)

voice_tools_webhook.js

1 const express = require("express");
2 const Anthropic = require("@anthropic-ai/sdk");
3 
4 const app = express();
5 app.use(express.json());
6 
7 const client = new Anthropic();
8 
9 const tools = [
10   {
11     name: "get_todays_calendar",
12     description: "Get the user's calendar events for today.",
13     input_schema: { type: "object", properties: {}, required: [] },
14   },
15   {
16     name: "search_orders",
17     description: "Look up a customer's recent orders.",
18     input_schema: {
19       type: "object",
20       properties: { query: { type: "string" } },
21       required: ["query"],
22     },
23   },
24 ];
25 
26 const toolHandlers = {
27   get_todays_calendar: (args) => fetchCalendarEvents(),
28   search_orders: (args) => searchOrderDb(args.query),
29 };
30 
31 async function runToolCall(userMessage) {
32   const messages = [{ role: "user", content: userMessage }];
33 
34   for (let i = 0; i < 5; i++) {
35     const response = await client.messages.create({
36       model: "claude-haiku-4-5-20251001",
37       max_tokens: 256,
38       system: "You are a helpful phone assistant. Keep responses to 2-3 sentences.",
39       tools,
40       messages,
41     });
42 
43     if (response.stop_reason === "tool_use") {
44       messages.push({ role: "assistant", content: response.content });
45       const toolResults = [];
46       for (const block of response.content) {
47         if (block.type === "tool_use") {
48           const handler = toolHandlers[block.name];
49           const result = handler ? await handler(block.input) : "Unknown tool";
50           toolResults.push({ type: "tool_result", tool_use_id: block.id, content: result });
51         }
52       }
53       messages.push({ role: "user", content: toolResults });
54     } else {
55       return response.content
56         .filter((b) => b.type === "text")
57         .map((b) => b.text)
58         .join(" ");
59     }
60   }
61   return "Sorry, I'm having trouble processing that.";
62 }
63 
64 app.post("/webhook", async (req, res) => {
65   if (req.body.channel !== "voice") return res.status(200).send("OK");
66 
67   const transcript = req.body.data?.transcript || "";
68 
69   res.setHeader("Content-Type", "application/x-ndjson");
70 
71   // Immediately tell the caller we're working on it
72   res.write(JSON.stringify({ text: "Let me check on that.", interim: true }) + "\n");
73 
74   // Now run the slow tool calls (LLM + external APIs)
75   try {
76     const answer = await runToolCall(transcript);
77     res.write(JSON.stringify({ text: answer }) + "\n");
78   } catch (err) {
79     res.write(JSON.stringify({ text: "Sorry, I ran into a problem." }) + "\n");
80   }
81   res.end();
82 });
83 
84 app.listen(3000);

Why interim chunks matter for tool calls

Without the interim chunk, the caller hears dead silence while your LLM decides which tool to call, the external API responds, and the LLM summarises the result. With streaming, they hear “Let me check on that” within milliseconds — just like a human assistant would.

Troubleshooting voice calls

Common issues and how to fix them.

Caller hears silence after speaking

Your webhook is too slow or not responding. Voice webhooks have a 30-second default timeout (configurable per webhook from 5–120 seconds). If your server doesn’t respond in time, the turn is dropped and the caller hears nothing.

Fix: Always stream an interim NDJSON chunk immediately (e.g. {"text": "One moment.", "interim": true}) before doing any slow work. This buys you time while keeping the caller engaged.

Common causes:

LLM tool calls that take too long (external API latency + LLM processing)
Cold starts on serverless platforms (Lambda, Cloud Functions)
Webhook URL is unreachable or returning errors

Caller hears silence after the greeting

Your webhook isn’t configured or isn’t returning a valid JSON object. Voice responses must be a JSON object ({...}). Non-object responses (strings, arrays, numbers) are ignored.

Fix: Verify your webhook is returning {"text": "..."}. Use POST /v1/webhooks/test to confirm your endpoint is reachable and responding correctly.

Response is cut off or sounds garbled

You’re sending the entire response as a single large chunk. Long responses in a single chunk can cause TTS delays.

Fix: Use NDJSON streaming and break responses into natural sentences. Send each sentence as an interim chunk so TTS can start speaking immediately.

Agent speaks XML or code artifacts

Your LLM is including tool-call markup in its response. Some LLMs emit <function_call> or similar tags.

Fix: Strip non-speech content from your LLM output before returning it. AgentPhone removes common patterns automatically, but your webhook should clean responses to be safe.

Webhook works for SMS but not voice

You’re returning a 200 OK with no body, or a non-JSON response for voice. SMS webhooks only need a 200 status — voice webhooks must return a JSON object with a text field.

Fix: Check the channel field in the webhook payload. For "voice", always return {"text": "..."}. For "sms", a 200 OK is sufficient.

Call recording

Call recording is an optional add-on that saves audio recordings of your voice calls. When enabled, completed calls include a recordingUrl field with a link to the audio file.

Field	Type	Description
`recordingUrl`	string or null	URL to the call recording audio file. Only populated when the recording add-on is enabled.
`recordingAvailable`	boolean	Whether a recording exists for this call. Can be `true` even when `recordingUrl` is null (recording exists but the add-on is not active).

Enable recording from the Billing page in the dashboard. See Usage & Billing for pricing.

Recordings are captured automatically for all calls while the add-on is active. If you disable the add-on, existing recordings are preserved but recordingUrl will be null until you re-enable it.

List calls

List all calls for this project.

GET /v1/calls

Query parameters

Parameter	Type	Required	Default	Description
`limit`	integer	No	20	Number of results to return (max 100)
`offset`	integer	No	0	Number of results to skip (min 0)
`status`	string	No	—	Filter by status: `completed`, `in-progress`, `failed`
`direction`	string	No	—	Filter by direction: `inbound`, `outbound`, `web`
`search`	string	No	—	Search by phone number (matches `fromNumber` or `toNumber`)

Example

$ curl -X GET "https://api.agentphone.to/v1/calls?limit=10&offset=0" \
>   -H "Authorization: Bearer YOUR_API_KEY"

1 {
2   "data": [
3     {
4       "id": "call_ghi012",
5       "agentId": "agt_abc123",
6       "phoneNumberId": "num_xyz789",
7       "phoneNumber": "+15551234567",
8       "fromNumber": "+15559876543",
9       "toNumber": "+15551234567",
10       "direction": "inbound",
11       "status": "completed",
12       "startedAt": "2025-01-15T14:00:00Z",
13       "endedAt": "2025-01-15T14:05:30Z",
14       "durationSeconds": 330,
15       "lastTranscriptSnippet": "Thank you for calling, goodbye!",
16       "recordingUrl": "https://api.twilio.com/2010-04-01/.../Recordings/RE...",
17       "recordingAvailable": true
18     }
19   ],
20   "hasMore": false,
21   "total": 1
22 }

Get call

Get details of a specific call, including its full transcript.

GET /v1/calls/{call_id}

Example

$ curl -X GET "https://api.agentphone.to/v1/calls/call_ghi012" \
>   -H "Authorization: Bearer YOUR_API_KEY"

1 {
2   "id": "call_ghi012",
3   "agentId": "agt_abc123",
4   "phoneNumberId": "num_xyz789",
5   "phoneNumber": "+15551234567",
6   "fromNumber": "+15559876543",
7   "toNumber": "+15551234567",
8   "direction": "inbound",
9   "status": "completed",
10   "startedAt": "2025-01-15T14:00:00Z",
11   "endedAt": "2025-01-15T14:05:30Z",
12   "durationSeconds": 330,
13   "recordingUrl": "https://api.twilio.com/2010-04-01/.../Recordings/RE...",
14   "recordingAvailable": true,
15   "transcripts": [
16     {
17       "id": "tr_001",
18       "transcript": "Hello! Thanks for calling Acme Corp. How can I help you today?",
19       "confidence": 0.95,
20       "response": "Sure! Could you please provide your order number?",
21       "createdAt": "2025-01-15T14:00:05Z"
22     },
23     {
24       "id": "tr_002",
25       "transcript": "Hi, I'd like to check the status of my order.",
26       "confidence": 0.92,
27       "response": "Of course! Let me look that up for you.",
28       "createdAt": "2025-01-15T14:00:15Z"
29     }
30   ]
31 }

Stream transcript (SSE)

Stream a call’s transcript in real time via Server-Sent Events. On connect the server replays all existing transcript turns from the database, then streams new turns as they arrive from the voice engine. Works for both live and completed calls — same URL either way.

GET /v1/calls/{call_id}/transcript/stream

The response is an SSE stream (Content-Type: text/event-stream) with the following event types:

Event	Description
`connected`	Sent once on connect with call metadata (callId, agentId, direction, etc.)
`turn`	A transcript turn — either replayed from history or arriving in real time. Contains `role` (`"user"` or `"agent"`), `content`, and `createdAt`.
`ended`	The call has ended. Contains `callId`, `status`, `endedAt`, and `durationSeconds`. The stream closes after this event.

A : heartbeat comment is sent every 15 seconds to keep proxies and load balancers from closing the connection.

Event payloads

connected

1 {
2   "callId": "call_ghi012",
3   "status": "in-progress",
4   "agentId": "agt_abc123",
5   "agentName": "Support Bot",
6   "direction": "inbound",
7   "fromNumber": "+15559876543",
8   "toNumber": "+15551234567",
9   "startedAt": "2025-01-15T14:00:00Z"
10 }

turn

1 {
2   "role": "user",
3   "content": "I need help with my order.",
4   "createdAt": "2025-01-15T14:00:10Z"
5 }

ended

1 {
2   "callId": "call_ghi012",
3   "status": "completed",
4   "endedAt": "2025-01-15T14:05:30Z",
5   "durationSeconds": 330
6 }

Example: curl

$ curl -N "https://api.agentphone.to/v1/calls/call_ghi012/transcript/stream" \
>   -H "Authorization: Bearer YOUR_API_KEY"

Example: JavaScript (Node.js / fetch stream)

1 const API_KEY = "YOUR_API_KEY";
2 const callId = "call_ghi012";
3 const response = await fetch(
4   `https://api.agentphone.to/v1/calls/${callId}/transcript/stream`,
5   { headers: { Authorization: `Bearer ${API_KEY}` } }
6 );
7 
8 if (!response.ok) {
9   throw new Error(`Failed to stream transcript: ${response.status}`);
10 }
11 
12 const reader = response.body.getReader();
13 const decoder = new TextDecoder();
14 let buffer = "";
15 let eventType = null;
16 
17 while (true) {
18   const { done, value } = await reader.read();
19   if (done) break;
20 
21   buffer += decoder.decode(value, { stream: true });
22   const lines = buffer.split("\n");
23   buffer = lines.pop();
24 
25   for (const line of lines) {
26     if (line.startsWith("event:")) {
27       eventType = line.slice(6).trim();
28     } else if (line.startsWith("data:")) {
29       const data = JSON.parse(line.slice(5).trim());
30       if (eventType === "connected") {
31         console.log(`Connected to call ${data.callId} (${data.status})`);
32       } else if (eventType === "turn") {
33         console.log(`[${data.role}] ${data.content}`);
34       } else if (eventType === "ended") {
35         console.log(`Call ended - ${data.durationSeconds}s`);
36       }
37     }
38   }
39 }

Example: Python

1 import requests
2 
3 API_KEY = "YOUR_API_KEY"
4 call_id = "call_ghi012"
5 
6 url = f"https://api.agentphone.to/v1/calls/{call_id}/transcript/stream"
7 headers = {"Authorization": f"Bearer {API_KEY}"}
8 
9 with requests.get(url, headers=headers, stream=True) as resp:
10     event_type = None
11     for line in resp.iter_lines(decode_unicode=True):
12         if line.startswith("event:"):
13             event_type = line[len("event:"):].strip()
14         elif line.startswith("data:"):
15             import json
16             data = json.loads(line[len("data:"):].strip())
17             if event_type == "turn":
18                 print(f"[{data['role']}] {data['content']}")
19             elif event_type == "ended":
20                 print(f"Call ended — {data['durationSeconds']}s")
21                 break

Live and completed calls use the same URL

For a live call, existing turns are replayed first, then the stream stays open and delivers new turns in real time until the call ends. For a completed call, all turns are replayed immediately followed by an ended event and the stream closes. Your client code doesn’t need to differentiate — just handle the events.

Create outbound call

Initiate an outbound voice call from one of your agent’s phone numbers. The agent’s first assigned phone number is used as the caller ID.

POST /v1/calls

Request body

Field	Type	Required	Description
`agentId`	string	Yes	The agent that will handle the call. Its first assigned phone number is used as caller ID.
`toNumber`	string	Yes	The phone number to call (E.164 format, e.g., `"+15559876543"`)
`initialGreeting`	string or null	No	Optional greeting to speak when the recipient answers
`voice`	string	No	Voice to use for speaking (default: `"Polly.Amy"`)
`systemPrompt`	string or null	No	When provided, uses a built-in LLM for the conversation instead of forwarding to your webhook.
`fromNumberId`	string or null	No	The ID of a specific phone number to use as the caller ID. If omitted, the system selects one of the agent’s attached numbers.

Example

$ curl -X POST "https://api.agentphone.to/v1/calls" \
>   -H "Authorization: Bearer YOUR_API_KEY" \
>   -H "Content-Type: application/json" \
>   -d '{
>     "agentId": "agt_abc123",
>     "toNumber": "+15559876543",
>     "initialGreeting": "Hi, this is Acme Corp calling about your recent order.",
>     "systemPrompt": "You are a friendly support agent from Acme Corp."
>   }'

Web calls

Web calls let users talk to your agent directly from a browser, no phone number needed. Use the agentphone-web-sdk npm package on the frontend and mint access tokens from your backend.

How it works:

Your backend calls POST /v1/calls/web with agentId to get an access token
The token is valid for 30 seconds. Pass it to the frontend immediately.
The frontend uses agentphone-web-sdk to start the call with the token

POST /v1/calls/web

Request body

Field	Type	Required	Description
`agentId`	string	Yes	The agent to call
`variables`	object	No	Template variables for hosted-mode agents. Referenced in the system prompt as `{{var_name}}`.

Response

Field	Type	Description
`accessToken`	string	Short-lived token for the web SDK (expires in 30 seconds)
`callId`	string	The call ID for this session

Example

$ curl -X POST "https://api.agentphone.to/v1/calls/web" \
>   -H "Authorization: Bearer YOUR_API_KEY" \
>   -H "Content-Type: application/json" \
>   -d '{"agentId": "agt_abc123"}'

The call direction will be web (in addition to inbound and outbound).

List calls for number

List all calls associated with a specific phone number.

GET /v1/numbers/{number_id}/calls

Example

$ curl -X GET "https://api.agentphone.to/v1/numbers/num_xyz789/calls?limit=10" \
>   -H "Authorization: Bearer YOUR_API_KEY"