Model Deployment Management
Model Deployment Version Status
- PreparingStarting up
- RunningAlready started, you can view its running logs
- Shutting DownIn the process of shutdown cleanup
- StoppedThis service version has been successfully shut down
- ErrorAbnormally shut down due to some special reasons, you can try to view the specific cause through logs
Model deployment versions in Error and Stopped status are considered inactive, allowing the version to be restarted or permanently deleted.
Model Deployment Monitoring Metrics
Model deployment supports two types of data monitoring.
System Metrics Monitoring
System Monitoring Metrics Description
The system monitoring interface mainly displays three categories of important resource usage metrics:
CPU Utilization
- Divided into user mode (user) and system mode (system) usage
- Displays total core count configuration
- Shows CPU utilization trends through time series curves
Memory Usage
- Displays currently used memory and total memory capacity
- Calculates memory usage percentage
- Chart shows real-time changes in memory usage
GPU Usage
- Monitors the usage of each GPU device
- Includes two key metrics:
- util: GPU compute unit utilization
- vram: GPU memory usage
 
- Hovering the mouse shows specific values at a point in time:
- Core utilization percentage
- Memory usage
 
These metrics are displayed intuitively in chart form, helping users monitor service running status and resource usage trends, facilitating timely detection of potential performance issues.
Request Usage Monitoring
The request monitoring interface displays service request processing performance metrics, including the following key dimensions:
Request Count Statistics
- Displays time distribution of request count in bar chart format
- Allows selection of different time intervals (5 minutes, 15 minutes, 30 minutes, etc.) to view statistical data
Request Time Statistics
Displays request processing time metrics across different dimensions through multiple curves:
- avg: Average response time
- max: Maximum response time
- Percentile statistics:
- p50: 50th percentile response time
- p75: 75th percentile response time
- p90: 90th percentile response time
- p95: 95th percentile response time
- p99: 99th percentile response time
 
Data Viewing Methods
- Hovering the mouse shows detailed metric data at specific time points
- Supports viewing request statistics within specified time ranges
- Can show/hide specific metrics through legend selection
- Upper right corner displays total requests and various statistical metrics for the currently selected time period
These metrics help users comprehensively understand service performance, and can be used for:
- Evaluating service response time
- Identifying performance bottlenecks
- Analyzing request processing capacity
- Monitoring service quality
API Key Management
By default, model deployment services are publicly accessible and can be called without authentication. You can enable access control on the model deployment settings page to activate the API Key authentication mechanism:

After enabling access control, model deployment supports two types of API Keys for access authentication:
- User-level API Key - Can access all model deployments under the user
- Model deployment-level API Key - Can only access one or more specified model deployments
You can manage API Keys on the model deployment settings page:

:::caution Note API Key creation, update, or deletion operations may take up to 5 minutes to take effect. :::
For detailed management and usage instructions for API Keys, please refer to API Key Management.
Online Testing
Each model deployment page has a built-in "Online Testing Tool" that supports debugging APIs directly in the browser. You can:
- Select request method (GET/POST)
- Fill in request path and parameters
- Support streaming output and multiple Content-Types
- View response content and response headers in real-time
This is very convenient for quickly validating model services and debugging interface parameters. See the "Version Testing" section on the deployment details page.
Scaling Model Deployments
When creating a model deployment, you can select the number of replicas. The more replicas, the higher the service load capacity.
On the "Settings" page of a running model deployment version, you can update the number of replicas.
After scaling, multiple instances will be displayed.
Logs will also display logs for each instance:
System metrics will also be displayed separately by instance count:
:::caution Note The more replicas, the stronger the service load capacity, which also means greater consumption of "computing resources" over the same time period. :::
Updating Model Deployments
A model deployment can contain multiple versions. When another model deployment version is created or started, the running model deployment version will be shut down, meaning only one running model deployment version is allowed under the same model deployment.
Deleting Unnecessary Model Deployment Versions
You can delete inactive model versions:
:::caution Note Model deployment versions that are still active (i.e., not in "Closed" or "Execution Error" status) cannot be deleted. :::
Deleting an Entire Model Deployment
If the entire model deployment is no longer needed, you can delete it on the "Settings" page of that model deployment.
:::caution Note If a model deployment has active model versions (i.e., not in "Closed" or "Execution Error" status), the model deployment cannot be deleted. :::