Deploy AI21’s Jamba models using vLLM in your own environment
This guide walks you through self-deploying AI21’s Jamba models in your own infrastructure. Choose the deployment method that best fits your needs.
We recommend using vLLM version v0.6.5
to v0.8.5.post1
for optimal performance and compatibility.
Create a Python virtual environment and install the vLLM package (version ≥0.6.5, ≤0.8.5.post1
to ensure maximum compatibility with all Jamba models).
Authenticate on the HuggingFace Hub using your access token $HF_TOKEN
:
Launch vLLM server for API-based inference:
Start the vLLM server:
Test the API:
Start the vLLM server:
Test the API:
Start the vLLM server:
Test the API:
Launch vLLM server for API-based inference:
Start the vLLM server:
Test the API:
Start the vLLM server:
Test the API:
Start the vLLM server:
Test the API:
In offline mode, vLLM loads the model to perform batch inference tasks in a one-off, standalone manner.
For containerized deployment, use vLLM’s official Docker image to run an inference server (refer to the vLLM Docker documentation for comprehensive details).
Pull the Docker image
Run the container
Launch vLLM in server mode with your chosen model:
Once the container is up and in healthy state, you will be able to test your inference using the same code samples as in the “Online Inference (Server Mode)” section.
If you prefer to use your own storage for model weights, you can download them from your self-hosted storage (e.g., AWS S3, Google Cloud Storage) and mount the local path to the container using -v /path/to/model:/mnt/model/
and --model="/mnt/model/"
instead of the HuggingFace model identifier.
Deploy on AWS, Google Cloud, or Azure for production workloads
Optimize performance and resolve common deployment issues
Learn about the complete API interface and parameters
Deploy AI21’s Jamba models using vLLM in your own environment
This guide walks you through self-deploying AI21’s Jamba models in your own infrastructure. Choose the deployment method that best fits your needs.
We recommend using vLLM version v0.6.5
to v0.8.5.post1
for optimal performance and compatibility.
Create a Python virtual environment and install the vLLM package (version ≥0.6.5, ≤0.8.5.post1
to ensure maximum compatibility with all Jamba models).
Authenticate on the HuggingFace Hub using your access token $HF_TOKEN
:
Launch vLLM server for API-based inference:
Start the vLLM server:
Test the API:
Start the vLLM server:
Test the API:
Start the vLLM server:
Test the API:
Launch vLLM server for API-based inference:
Start the vLLM server:
Test the API:
Start the vLLM server:
Test the API:
Start the vLLM server:
Test the API:
In offline mode, vLLM loads the model to perform batch inference tasks in a one-off, standalone manner.
For containerized deployment, use vLLM’s official Docker image to run an inference server (refer to the vLLM Docker documentation for comprehensive details).
Pull the Docker image
Run the container
Launch vLLM in server mode with your chosen model:
Once the container is up and in healthy state, you will be able to test your inference using the same code samples as in the “Online Inference (Server Mode)” section.
If you prefer to use your own storage for model weights, you can download them from your self-hosted storage (e.g., AWS S3, Google Cloud Storage) and mount the local path to the container using -v /path/to/model:/mnt/model/
and --model="/mnt/model/"
instead of the HuggingFace model identifier.
Deploy on AWS, Google Cloud, or Azure for production workloads
Optimize performance and resolve common deployment issues
Learn about the complete API interface and parameters