Computing Container FAQ (Frequently Asked Questions)

What does the "Kernel Restarting" error mean during Jupyter execution?

When the kernel in Jupyter Notebook crashes, stops, or encounters an error, Jupyter Notebook will automatically attempt to restart the kernel. During this process, all variables and data in the notebook will be cleared and need to be rerun. Possible causes include code errors, memory issues, resource limitations, or other system problems. We recommend checking for syntax and logic errors in your code and ensuring that memory usage does not exceed available capacity. If the problem persists, try using a higher version of Python or upgrading related dependency libraries.

How do I find the path to an executable program?

You can use the which command to find it:

which darknet

Result:

/usr/local/bin/darknet

If the file is in a non-standard path, you can use the find command:

find / -name program

For more detailed usage instructions, please refer to man which and man find

Why can't I exit Vim's edit mode in the workspace Terminal?

According to user feedback, if you encounter abnormal behavior when using Vim to edit files in the workspace Terminal (mainly manifested as being unable to exit the current mode after entering insert mode with i and then pressing esc - :wq), please check whether you have Vim-related plugins/extensions such as Vimium or cVim installed and activated in Google Chrome browser. Certain settings of these plugins are very likely to conflict with Terminal shortcuts in the workspace.

Recommendation: Disable related plugins/extensions, or switch to a different browser

Why did my original content disappear after restarting the container using "Continue Execution"?

The section Why do we have execution records explains that each container execution is an independent environment, and Continue execution from a specified execution record explains what "Continue Execution" does: it binds the input from the previous execution to the /openbayes/home directory of the new execution. The "previous execution" refers to the execution where the user clicked "Continue Execution". If that execution had no output content, then the new execution will naturally have no corresponding data either. This situation may occur because the execution where "Continue Execution" was clicked never actually started before being closed, while the execution before that had actual data.

For example, I created a new container cifar10 and created an execution: the first execution, opened as a Jupyter workspace, I wrote an .ipynb file and then closed it.

The Why Keep Execution Records section explains that each container execution has an independent environment, and manually installed dependencies will disappear. If you want the relevant dependencies to be available every time a container starts, please refer to How to Add Dependencies Not in the List. Additionally, we periodically update images to include more and more common dependencies.

Why is my container stuck in the "Preparing" state?

A container startup being stuck in the preparing state for a long time may be due to the following reasons:

A large number of files are bound to /openbayes/home. During the startup phase, the container spends time copying large amounts of data to /openbayes/home, especially when copying many small files, which can be very time-consuming. On the execution page, you will also see "Data Synchronization" displayed with the current data synchronization speed. If you don't need write access to the corresponding data, it's recommended to create it as a dataset and bind it to /input0-4 to avoid the copying process. See Compute Container Working Directory - Creating a Data Repository Version from Working Directory for how to create a new dataset version from the container's "working directory."
The container is in a cold start state. The machine starting the container currently doesn't have the required image and is pulling it. Although we have added image nodes within the cluster to improve image acquisition time, in this cold start state, 3-5 minutes will still be spent on image pulling. On the execution page, you will also see this execution is in the "Pulling Image" state.
Network issues. Due to internal network fluctuations, problems occur during the container creation process. If you've ruled out the above two possibilities (i.e., you haven't loaded large amounts of data to /openbayes/home, but the container still hasn't started after 5-10 minutes), you can try restarting the container or contact customer service through the chat window on the interface.

Why is my container stuck in the "Pulling Image" state?

When your container remains in the "Pulling Image" state for an extended period, this is usually related to cold start. Cold start means that the compute node you requested hasn't cached the container image you need, and the system needs to download the complete image from the image repository.

This situation is normal, especially for:

First-time use of a specific type of environment or framework
Using newer versions of images
Using larger special images

Under normal circumstances, the image pulling process typically takes 3-5 minutes to complete, with the specific time depending on the image size and current network conditions. During this time, please be patient while the system completes image pulling and container initialization.

Abnormal situations: If the container remains in the "Pulling Image" state for more than 10 minutes, it may be due to the following reasons:

The image repository is temporarily unavailable
System resource scheduling issues

In this case, we recommend you:

Refresh the page to view the latest status
Try restarting the container creation process
If the problem persists, please contact the platform administrator or customer service through the interface chat window, providing your container ID and detailed information

Why is my container stuck in the "Synchronizing Data and Closing" state?

This situation usually has the same cause as slow container opening - a large number of files are bound to /openbayes/home, and when the container closes, it needs to synchronize large amounts of data. Synchronizing large amounts of data, especially many small files, is very time-consuming. If you don't need write access to the corresponding data, it's recommended to create it as a dataset and bind it to /input0-4 to avoid the copying process.

See Compute Container Working Directory - Creating a Data Repository Version from Working Directory for how to create a new dataset version from the container's "working directory."

If there aren't many files in the container and this situation still occurs, please contact customer service through the chat window on the interface.

How do I upload, view, and preview .ipynb files?

You can upload through two methods:

In computing container mode, you can create a new computing container, select Workspace as the access method, then open the workspace after the container starts, and upload .ipynb files in the workspace.

If you are working with a pre-trained model, you can directly package the .ipynb file into a Zip file and upload it. After uploading, you can access the notebook in the file list of the pre-trained model. Later, you can also create a new container and bind it to the pre-trained model to perform interactive operations in the workspace.

Why is the storage space in the container different from what I see in "Resource Usage"? What is their relationship?

The storage space in the container is the workspace size of the corresponding "working directory" (/openbayes/home directory) after opening a container. For specific computing resources, this space is fixed. For example, "T4" type computing resources have a workspace of 50GB. The "Resource Usage" reflects the user's global storage space. The container's storage space is used when the user opens a container. After the container is closed, the data in its "working directory" (/openbayes/home directory) will be synchronized to the user's global storage space and occupy the corresponding quota. On the other hand, when users upload datasets, they are directly counted toward the user's global storage space quota.

When the user's global storage space exceeds the limit, the user will no longer be able to create containers or upload datasets. Space can be freed by deleting executions, deleting datasets, etc.

Why is my container frozen, and I can't even open the Jupyter workspace page?

The computing resources (CPU, GPU, memory) of containers are fixed. If a program occupies all CPU resources, it will severely affect other processes. For example, some programs will occupy all CPU resources at once, which will make the Jupyter workspace program itself very sluggish, and may even prevent the Jupyter workspace page from opening. So if you find that your Jupyter workspace page cannot be opened, first check whether the container's CPU is running at full capacity (important metrics such as CPU and memory are displayed on the container page).

If the container is very sluggish or the Jupyter workspace page cannot be opened, you can try logging into the container via SSH. If you still cannot log in, you can try restarting the container.

Why doesn't my "Workspace" automatically close?

The automatic closure of "Workspace" requires two conditions to be met:

The Jupyter page is closed
CPU usage remains close to 0 continuously

Note that if the Jupyter page is not closed in the browser, the system will consider the "Workspace" to still be in use and will not trigger idle shutdown.

Why is my GPU utilization very low, and how can I improve it?

Low GPU utilization in deep learning training and inference is a common problem. Based on our analysis of numerous production environments, over 80% of GPU inefficiency issues stem from design flaws in CPU-GPU collaborative pipelines. Below are optimization strategies from three dimensions:

Data Loading Optimization

Storage Reading Strategy

When training data is stored in distributed file systems, cross-region access latency can reach 50-200ms. Taking the ImageNet dataset as an example, the SM utilization of an A100 GPU may drop from 95% to 32%. Optimization solutions:

Local caching: Cache data in compute node memory and merge small files into TFRecord format for reading
Parallel prefetching: Set num_workers=CPU cores×0.8, and increase prefetch_factor appropriately

Memory Transfer Acceleration

from torch.utils.data import DataLoader

```python
DataLoader(dataset, 
           num_workers=8, 
           pin_memory=True,  # Key parameter
           persistent_workers=True,
           prefetch_factor=2)

This configuration can reduce data loading latency by 75% and significantly decrease GPU idle time.

Computation Scheduling Optimization

Asynchronous Operation Design

CPU operations such as model saving and log printing can block CUDA Stream. Optimization measures:

# Asynchronously save model
with ThreadPoolExecutor(max_workers=1) as executor:
    executor.submit(torch.save, model.state_dict(), "checkpoint.pt")

Batch Size Adjustment

A batch size that is too small cannot fully utilize GPU parallel capabilities, while one that is too large may cause memory overflow. Best practices:

Start with small batches and gradually increase until memory usage reaches 80%
Use gradient accumulation to increase effective batch size: loss.backward(); optimizer.step() if (i+1) % accumulation == 0 else None

System Parameter Tuning

The following environment variables have significant impact on performance:

# Basic CUDA parameter optimization
export CUDA_DEVICE_MAX_CONNECTIONS=1     # Reduce context switching overhead
export CUDA_CACHE_DISABLE=0              # Enable CUDA kernel cache to improve repeated computation speed
export CUDA_VISIBLE_DEVICES=0,1,2,3      # Specify GPU device IDs to use

# NCCL communication optimization (multi-GPU training)
export NCCL_P2P_DISABLE=0                # Enable peer-to-peer communication
export NCCL_IB_DISABLE=0                 # Enable InfiniBand transport
export NCCL_DEBUG=INFO                   # Can be enabled during debugging, recommended to disable in production
export NCCL_SOCKET_IFNAME=eth0           # Specify network interface to avoid selecting wrong interface

# Parallel computation optimization
export OMP_NUM_THREADS=8                 # Limit OpenMP threads to avoid contention with DataLoader
export MKL_NUM_THREADS=8                 # Limit MKL threads to similarly avoid resource contention

When using PyTorch, the following parameters can also be set through code:

# Optimize CUDA memory allocation strategy
torch.backends.cudnn.benchmark = True  # Significant acceleration for fixed input sizes
torch.backends.cudnn.deterministic = False  # Non-deterministic mode is usually faster
torch.cuda.set_per_process_memory_fraction(0.9)  # Reserve some memory to prevent OOM

Performance Diagnostic Tools

Use PyTorch Profiler to locate bottlenecks:

with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3)
) as prof:
    for step, data in enumerate(dataloader):
        outputs = model(inputs)
        prof.step()

Key points to focus on during analysis:

Difference between CPU Exec and GPU Exec time
Frequency of cudaMemcpyAsync calls
GPU core utilization fluctuation

Optimal GPU utilization should maintain above 85%, with valley duration less than 50ms, indicating a well-functioning data processing pipeline.

Can my two containers access each other?

Yes, your containers can access each other.

Get the IP address of the target container:

Log in to the target container (the container you want to connect to), and execute in the Terminal:
```
ip a
```
If the command is not found, install network tools first:
```
apt install iproute2 -y
ip a
```
In the output, find the IP address corresponding to the eth0 interface (e.g., 10.38.6.87).
Establish connection from the source container:

In the other container (the container initiating the connection), use SSH to connect to the target container:
```
ssh 10.38.6.87
```
:::info Containers created by the same user have pre-configured SSH keys and can connect directly without a password. :::

Transfer files between containers:

# Copy file from current container to target container
scp /path/to/local/file 10.38.6.87:/path/to/remote/

# Copy file from target container to current container
scp 10.38.6.87:/path/to/remote/file /path/to/local/

# Synchronize directories using rsync
rsync -avz /path/to/local/dir/ 10.38.6.87:/path/to/remote/dir/

For more details, please refer to the Inter-Container Communication documentation.

Can I use my own image?

Currently, regular users on the platform cannot directly use custom images. We have pre-installed images for various commonly used frameworks and versions, including:

Deep Learning Frameworks: TensorFlow, PyTorch, MXNet and other mainstream versions
Large Language Model Frameworks: vLLM, SGLang, etc.
Other Frameworks: PaddlePaddle, Darknet, etc.

For custom image requirements, we provide the following solutions:

For regular users:
- Use pip install --user to install Python dependencies, which will be saved in the user directory
- Create custom conda environments and place them in the /openbayes/home directory
- Add necessary dependency installation commands in startup scripts
For dedicated deployment users:
- Full custom image customization service is supported
- All custom images must be built based on Ubuntu operating system
- Can customize environments, dependencies, and configurations according to specific enterprise requirements
For Enterprise Users:
- Our enterprise private deployment service provides customized image support
- Enterprise users can build custom Docker images based on our base images
- Custom images can be directly specified during model deployment

If you have special image requirements, we recommend contacting customer service to discuss specific solutions.

What should I do if I encounter issues downloading models from HuggingFace?

If you encounter connection errors when using HuggingFace-related features, you can resolve this by setting up a mirror:

Install Dependencies

pip install -U huggingface_hub

Set Environment Variable

export HF_ENDPOINT=https://hf-mirror.com

Usage Example

huggingface-cli download --resume-download gpt2 --local-dir gpt2

For dataset downloads:

huggingface-cli download --repo-type dataset --resume-download wikitext --local-dir wikitext

Computing Container FAQ (Frequently Asked Questions)

On this page