vllm.v1.utils ¶
APIServerProcessManager ¶
Manages a group of API server processes.
Handles creation, monitoring, and termination of API server worker processes. Also monitors extra processes to check if they are healthy.
Source code in vllm/v1/utils.py
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 | |
__init__ ¶
__init__(
listen_address: str,
sock: Any,
args: Namespace,
num_servers: int,
input_addresses: list[str],
output_addresses: list[str],
target_server_fn: Callable | None = None,
stats_update_address: str | None = None,
tensor_queue: Queue | None = None,
)
Initialize and start API server worker processes.
input_addresses/output_addresses may contain tcp://host:0 placeholders; each child must report the actual bound endpoint over its actual_address_pipe in client_config and the parent collects them via meth:
gather_actual_addresses.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target_server_fn | Callable | None | Override function to call for each API server process | None |
listen_address | str | Address to listen for client connections | required |
sock | Any | Socket for client connections | required |
args | Namespace | Command line arguments | required |
num_servers | int | Number of API server processes to start | required |
input_addresses | list[str] | Input addresses for each API server | required |
output_addresses | list[str] | Output addresses for each API server | required |
stats_update_address | str | None | Optional stats update address | None |
tensor_queue | Queue | None | Optional tensor IPC queue for sharing MM tensors | None |
Source code in vllm/v1/utils.py
gather_actual_addresses ¶
gather_actual_addresses(
timeout: float = VLLM_ENGINE_READY_TIMEOUT_S,
) -> tuple[list[str], list[str]]
Return (inputs, outputs) reported by each child, indexed by client_index. Raises RuntimeError on timeout or premature child exit.
Source code in vllm/v1/utils.py
shutdown ¶
shutdown(timeout: float | None = None) -> None
Shutdown API server processes with configurable timeout
Source code in vllm/v1/utils.py
CpuGpuBuffer ¶
Buffer to easily copy tensors between CPU and GPU.
Source code in vllm/v1/utils.py
copy_to_cpu ¶
NOTE: Because this method is non-blocking, explicit synchronization is needed to ensure the data is copied to CPU.
Source code in vllm/v1/utils.py
RustFrontendProcessManager ¶
Manages a single Rust frontend subprocess.
Launches the Rust vllm-rs binary in 'frontend' mode, passing the listening socket fd and ZMQ transport addresses. Provides the same interface as APIServerProcessManager for process monitoring.
Source code in vllm/v1/utils.py
_SubprocessWrapper ¶
Wraps subprocess.Popen to provide the BaseProcess-like interface needed by wait_for_completion_or_failure.
Source code in vllm/v1/utils.py
_shutdown_subprocesses ¶
_shutdown_subprocesses(
procs: list[_SubprocessWrapper],
timeout: float | None = None,
) -> None
Shutdown subprocess wrappers (mirrors the shutdown() function).
Source code in vllm/v1/utils.py
compute_iteration_details ¶
Compute the number of context/generation requests and tokens for the current iteration's scheduler output. A requests is regarded as a context request if its output tokens are still 0, an extended chunk of chunked prefill falls into this category.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scheduler_output | SchedulerOutput | The scheduler output for the current iteration. | required |
Returns:
| Type | Description |
|---|---|
IterationDetails | An IterationDetails object containing the number of |
IterationDetails | context/generation requests and tokens. |
Source code in vllm/v1/utils.py
copy_slice ¶
Copy the first length elements of a tensor into another tensor in a non-blocking manner.
Used to copy pinned CPU tensor data to pre-allocated GPU tensors.
Returns the sliced target tensor.
Source code in vllm/v1/utils.py
get_engine_client_zmq_addr ¶
Return an IPC path (local_only=True) or tcp://host:port.
port=0 lets the kernel assign the port at bind() time; the caller must recover it via getsockopt(zmq.LAST_ENDPOINT).
Source code in vllm/v1/utils.py
report_usage_stats ¶
Report usage statistics if enabled.
Source code in vllm/v1/utils.py
run_api_server_worker_proc ¶
run_api_server_worker_proc(
listen_address,
sock,
args,
client_config=None,
**uvicorn_kwargs,
) -> None
Entrypoint for individual API server worker processes.
Source code in vllm/v1/utils.py
shutdown ¶
Shutdown processes with timeout.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
procs | list[BaseProcess] | List of processes to shutdown | required |
timeout | float | None | Maximum time in seconds to wait for graceful shutdown | None |
Source code in vllm/v1/utils.py
tensor_data ¶
tensor_data(tensor: Tensor) -> memoryview
Get the raw data of a tensor as a uint8 memoryview, useful for serializing and hashing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tensor | Tensor | The input tensor. | required |
Returns:
| Type | Description |
|---|---|
memoryview | A memoryview of the tensor data as uint8. |
Source code in vllm/v1/utils.py
wait_for_completion_or_failure ¶
wait_for_completion_or_failure(
api_server_manager: APIServerProcessManager
| RustFrontendProcessManager,
engine_manager: Union[
CoreEngineProcManager, CoreEngineActorManager
]
| None = None,
coordinator: DPCoordinator | None = None,
) -> None
Wait for all processes to complete or detect if any fail.
Raises an exception if any process exits with a non-zero status.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
api_server_manager | APIServerProcessManager | RustFrontendProcessManager | The manager for API servers. | required |
engine_manager | Union[CoreEngineProcManager, CoreEngineActorManager] | None | The manager for engine processes. If CoreEngineProcManager, it manages local engines; if CoreEngineActorManager, it manages all engines. | None |
coordinator | DPCoordinator | None | The coordinator for data parallel. | None |