Quick Start

This article will demonstrate how to deploy a large language model using vLLM on HyperAI through a practical example. We will deploy the DeepSeek-R1-Distill-Qwen-1.5B model, which is a lightweight model based on Qwen.

Model Introduction

DeepSeek-R1-Distill-Qwen-1.5B is a lightweight Chinese-English bilingual dialogue model:

1.5B parameters, can be deployed on a single card
Minimum VRAM requirement: 3GB
Recommended VRAM configuration: 4GB and above

Development and Testing in Model Training

Create a New Model Training

Select RTX 4090 compute power
Select vLLM 0.7.2 base image
In data binding, select the DeepSeek-R1-Distill-Qwen-1.5B model and bind it to /openbayes/input/input0

Prepare the Startup Script `start.sh`

After the container starts, prepare the following start.sh script.

start.sh

#!/bin/bash

# Get GPU count
GPU_COUNT=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)

# Set port, model deployment default exposed port is 80 while model training default exposed port is 8080
PORT=8080
if [ ! -z "$OPENBAYES_SERVING_PRODUCTION" ]; then
    PORT=80
fi

# Start vLLM service
echo "Starting vLLM service..."
vllm serve /openbayes/input/input0 \
    --served-model-name DeepSeek-R1-Distill-Qwen-1.5B \
    --disable-log-requests \
    --trust-remote-code \
    --host 0.0.0.0 --port $PORT \
    --gpu-memory-utilization 0.98 \
    --max-model-len 8192 --enable-prefix-caching \
    --tensor-parallel-size $GPU_COUNT

Test the service in the container

bash start.sh

Here is an example curl request to test model inference:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
        {
            "role": "user",
            "content": "请用中文解释什么是大语言模型"
        }
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Open a new terminal in Jupyter and paste the curl command above for testing:

:::note Port 8080 is used when testing in model training, but it will automatically switch to port 80 in model deployment. This is because the model deployment service of HyperAI requires port 80 to be used for external service provision. :::

Deploy Model Service

After completing model training and testing, you can convert the model into an accessible deployment service through the following two methods:

Method 1: One-Click Model Deployment (Recommended)

HyperAI provides a convenient "one-click deployment" feature that allows you to quickly convert model training into a model deployment service without repeated configuration.

Using the One-Click Deployment Feature

On the model training details page, find the "One-Click Deployment" button in the upper right corner
Confirm the deployment configuration information (the system will automatically inherit the training container's configuration)
Click "Confirm Deployment", and the system will automatically create the corresponding model deployment service

Deployment Configuration Confirmation

The system will automatically inherit the following configurations from the original training container:

Computing resource settings
Base image
Workspace data
Data binding relationships

You can adjust these configurations as needed on the confirmation page.

Deployment Success

After submission, the system will automatically create the model deployment and start the service. Upon success, it will redirect to the model deployment details page, where you can immediately use the online testing tool to verify the interface.

Method 2: Manually Create Model Deployment

If you need to configure the deployment environment more flexibly, or want to create a model deployment from scratch, you can manually create it by following these steps:

Select Computing Power - Select Image - Bind Data

Select RTX 4090 computing power
Select vLLM 0.7.2 base image
In data binding, select the DeepSeek-R1-Distill-Qwen-1.5B model and bind it to /openbayes/input/input0
Bind the workspace from the previous container to /openbayes/home

Launch Deployment

Click "Deploy" and wait for the model deployment to change to running status.

Click on the running model deployment version to view the current deployment details and logs.

Online Testing

On the model deployment details page, HyperAI provides an online testing tool that allows you to visually write and send HTTP requests on the web, quickly test model interfaces without local command line or third-party tools.

You can:

Select request method (such as GET, POST)
Fill in interface path and parameters
Customize request headers and request body (supports JSON and other formats)
Send requests with one click, view response content and response headers in real-time
Support streaming output to experience the streaming inference effect of large models

GET Request Example

Used to retrieve model information or health check. Select the GET method, fill in the interface path (such as /v1/models), click send to view the model list or status information.

POST Request Example

Used to have conversations with large language models. Select the POST method, fill in the path /v1/chat/completions, enter the conversation content in the request body (as shown below), click send to experience the model response.

{
  "model": "qwen3-32b",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What's the weather like in Beijing?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g., San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "The temperature unit to use"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Streaming Call Example

Used to experience the streaming inference effect of large models. Simply add the "stream": true field in the POST request body, and after sending the request, you can see the model's progressive output in real-time, suitable for scenarios that require consuming results while generating.

{
  "model": "qwen3-32b",
  "stream": true,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "How is the weather in Beijing?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g., San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "The temperature unit to use"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Command Line Testing

If you prefer using command line tools (such as curl), you can also refer to the following method for interface testing:

On the model deployment page, you can see the url generated for the model deployment as HyperAI. Copy the url and use the following command line to test whether the model is available.

curl -X POST http://<model deployment url>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
        {
            "role": "user",
            "content": "你好，请介绍一下自己"
        }
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Next Steps

Learn more about managing model deployments
Check out the vLLM official documentation

Quick Start

On this page