vllm.utils ¶
Modules:
| Name | Description |
|---|---|
argparse_utils | Argument parsing utilities for vLLM. |
async_utils | Contains helpers related to asynchronous code. |
cache | |
collection_utils | Contains helpers that are applied to collections. |
deep_gemm | Compatibility wrapper for DeepGEMM API changes. |
flashinfer | Compatibility wrapper for FlashInfer API changes. |
func_utils | Contains helpers that are applied to functions. |
gc_utils | |
hashing | |
import_utils | Contains helpers related to importing modules. |
jsontree | Helper functions to work with nested JSON structures. |
math_utils | Math utility functions for vLLM. |
mem_constants | |
mem_utils | |
nccl | |
network_utils | |
platform_utils | |
profiling | |
serial_utils | |
system_utils | |
tensor_schema | |
torch_utils | |
MULTIMODAL_MODEL_MAX_NUM_BATCHED_TOKENS module-attribute ¶
POOLING_MODEL_MAX_NUM_BATCHED_TOKENS module-attribute ¶
_DEPRECATED_MAPPINGS module-attribute ¶
_DEPRECATED_MAPPINGS = {
"cprofile": "profiling",
"cprofile_context": "profiling",
"get_open_port": "network_utils",
}
AtomicCounter ¶
An atomic, thread-safe counter
Source code in vllm/utils/__init__.py
Counter ¶
Device ¶
LayerBlockType ¶
__dir__ ¶
__getattr__ ¶
Module-level getattr to handle deprecated utilities.
Source code in vllm/utils/__init__.py
_maybe_force_spawn ¶
Check if we need to force the use of the spawn multiprocessing start method.
Source code in vllm/utils/__init__.py
check_use_alibi ¶
check_use_alibi(model_config: ModelConfig) -> bool
Source code in vllm/utils/__init__.py
enable_trace_function_call_for_thread ¶
enable_trace_function_call_for_thread(
vllm_config: VllmConfig,
) -> None
Set up function tracing for the current thread, if enabled via the VLLM_TRACE_FUNCTION environment variable
Source code in vllm/utils/__init__.py
get_mp_context ¶
Get a multiprocessing context with a particular method (spawn or fork). By default we follow the value of the VLLM_WORKER_MULTIPROC_METHOD to determine the multiprocessing method (default is fork). However, under certain conditions, we may enforce spawn and override the value of VLLM_WORKER_MULTIPROC_METHOD.
Source code in vllm/utils/__init__.py
import_pynvml ¶
Historical comments:
libnvml.so is the library behind nvidia-smi, and pynvml is a Python wrapper around it. We use it to get GPU status without initializing CUDA context in the current process. Historically, there are two packages that provide pynvml: - nvidia-ml-py (https://pypi.org/project/nvidia-ml-py/): The official wrapper. It is a dependency of vLLM, and is installed when users install vLLM. It provides a Python module named pynvml. - pynvml (https://pypi.org/project/pynvml/): An unofficial wrapper. Prior to version 12.0, it also provides a Python module pynvml, and therefore conflicts with the official one. What's worse, the module is a Python package, and has higher priority than the official one which is a standalone Python file. This causes errors when both of them are installed. Starting from version 12.0, it migrates to a new module named pynvml_utils to avoid the conflict. It is so confusing that many packages in the community use the unofficial one by mistake, and we have to handle this case. For example, nvcr.io/nvidia/pytorch:24.12-py3 uses the unofficial one, and it will cause errors, see the issue https://github.com/vllm-project/vllm/issues/12847 for example. After all the troubles, we decide to copy the official pynvml module to our codebase, and use it directly.
Source code in vllm/utils/__init__.py
init_cached_hf_modules ¶
kill_process_tree ¶
kill_process_tree(pid: int)
Kills all descendant processes of the given pid by sending SIGKILL.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pid | int | Process ID of the parent process | required |
Source code in vllm/utils/__init__.py
length_from_prompt_token_ids_or_embeds ¶
length_from_prompt_token_ids_or_embeds(
prompt_token_ids: list[int] | None,
prompt_embeds: Tensor | None,
) -> int
Calculate the request length (in number of tokens) give either prompt_token_ids or prompt_embeds.
Source code in vllm/utils/__init__.py
run_method ¶
run_method(
obj: Any,
method: str | bytes | Callable,
args: tuple[Any],
kwargs: dict[str, Any],
) -> Any
Run a method of an object with the given arguments and keyword arguments. If the method is string, it will be converted to a method using getattr. If the method is serialized bytes and will be deserialized using cloudpickle. If the method is a callable, it will be called directly.
Source code in vllm/utils/__init__.py
set_ulimit ¶
Source code in vllm/utils/__init__.py
warn_for_unimplemented_methods ¶
A replacement for abc.ABC. When we use abc.ABC, subclasses will fail to instantiate if they do not implement all abstract methods. Here, we only require raise NotImplementedError in the base class, and log a warning if the method is not implemented in the subclass.