HyperAIHyperAI

Model Deployment Management

Model Deployment Version Status

  • Preparing Starting up
  • Running Already started, you can view its running logs
  • Shutting Down In the process of shutdown cleanup
  • Stopped This service version has been successfully shut down
  • Error Abnormally shut down due to some special reasons, you can try to view the specific cause through logs

Model deployment versions in Error and Stopped status are considered inactive, allowing the version to be restarted or permanently deleted.

Model Deployment Monitoring Metrics

Model deployment supports two types of data monitoring.

System Metrics Monitoring

System Monitoring Metrics Description

The system monitoring interface mainly displays three categories of important resource usage metrics:

CPU Utilization

  • Divided into user mode (user) and system mode (system) usage
  • Displays total core count configuration
  • Shows CPU utilization trends through time series curves

Memory Usage

  • Displays currently used memory and total memory capacity
  • Calculates memory usage percentage
  • Chart shows real-time changes in memory usage

GPU Usage

  • Monitors the usage of each GPU device
  • Includes two key metrics:
    • util: GPU compute unit utilization
    • vram: GPU memory usage
  • Hovering the mouse shows specific values at a point in time:
    • Core utilization percentage
    • Memory usage

These metrics are displayed intuitively in chart form, helping users monitor service running status and resource usage trends, facilitating timely detection of potential performance issues.

Request Usage Monitoring

The request monitoring interface displays service request processing performance metrics, including the following key dimensions:

Request Count Statistics

  • Displays time distribution of request count in bar chart format
  • Allows selection of different time intervals (5 minutes, 15 minutes, 30 minutes, etc.) to view statistical data

Request Time Statistics

Displays request processing time metrics across different dimensions through multiple curves:

  • avg: Average response time
  • max: Maximum response time
  • Percentile statistics:
    • p50: 50th percentile response time
    • p75: 75th percentile response time
    • p90: 90th percentile response time
    • p95: 95th percentile response time
    • p99: 99th percentile response time

Data Viewing Methods

  • Hovering the mouse shows detailed metric data at specific time points
  • Supports viewing request statistics within specified time ranges
  • Can show/hide specific metrics through legend selection
  • Upper right corner displays total requests and various statistical metrics for the currently selected time period

These metrics help users comprehensively understand service performance, and can be used for:

  • Evaluating service response time
  • Identifying performance bottlenecks
  • Analyzing request processing capacity
  • Monitoring service quality

API Key Management

By default, model deployment services are publicly accessible and can be called without authentication. You can enable access control on the model deployment settings page to activate the API Key authentication mechanism:

API Key Management Interface

After enabling access control, model deployment supports two types of API Keys for access authentication:

  1. User-level API Key - Can access all model deployments under the user
  2. Model deployment-level API Key - Can only access one or more specified model deployments

You can manage API Keys on the model deployment settings page:

API Key Management Interface

:::caution Note API Key creation, update, or deletion operations may take up to 5 minutes to take effect. :::

For detailed management and usage instructions for API Keys, please refer to API Key Management.

Online Testing

Each model deployment page has a built-in "Online Testing Tool" that supports debugging APIs directly in the browser. You can:

  • Select request method (GET/POST)
  • Fill in request path and parameters
  • Support streaming output and multiple Content-Types
  • View response content and response headers in real-time

This is very convenient for quickly validating model services and debugging interface parameters. See the "Version Testing" section on the deployment details page.

Scaling Model Deployments

When creating a model deployment, you can select the number of replicas. The more replicas, the higher the service load capacity.

On the "Settings" page of a running model deployment version, you can update the number of replicas.

After scaling, multiple instances will be displayed.

Logs will also display logs for each instance:

System metrics will also be displayed separately by instance count:

:::caution Note The more replicas, the stronger the service load capacity, which also means greater consumption of "computing resources" over the same time period. :::

Updating Model Deployments

A model deployment can contain multiple versions. When another model deployment version is created or started, the running model deployment version will be shut down, meaning only one running model deployment version is allowed under the same model deployment.

Deleting Unnecessary Model Deployment Versions

You can delete inactive model versions:

:::caution Note Model deployment versions that are still active (i.e., not in "Closed" or "Execution Error" status) cannot be deleted. :::

Deleting an Entire Model Deployment

If the entire model deployment is no longer needed, you can delete it on the "Settings" page of that model deployment.

:::caution Note If a model deployment has active model versions (i.e., not in "Closed" or "Execution Error" status), the model deployment cannot be deleted. :::