AI GPU 分布式训练集群RAY - IDC机房托管|北京IDC机房租用租赁|IDC机房机柜租用租赁|电信联通移动IDC托管租用-价格及费用咨询

关于几个坑

1，ray-ml 的images 里的cuda 版本和pytorch 版本还有node 节点的驱动必须对应，否则在跑训练的时候，显卡驱动会提示，cuda 版本不匹配，导致无法启动，但是tesla 版本的显卡就不会有这样的问题，比如a6000和a100 但是我在3090上遇到了这个问题，具体原因，查看大牛文档，链接：https://zhuanlan.zhihu.com/p/361545761

2，我用的kuberay来部署的，这玩意有几个缺点，除了修改image tag 意外，其他修改必须手动重启pod

3，我使用client sdk 来递交任务，submit task,，应用尝试是用任务调度器，来提交任务到ray cluster 集群而且也需要获取远程日志，代码如下

from ray.job_submission import JobSubmissionClient, JobStatus
import time,asyncio

# If using a remote cluster, replace 127.0.0.1 with the head node's IP address.
client = JobSubmissionClient("http://raycluster-ip:8265")
kick_off_pytorch_benchmark = (
    #"git clone https://github.com/ray-project/ray || true;"
    "python ray/release/air_tests/air_benchmarks/workloads/pytorch_training_e2e.py"
    " --data-size-gb=1 --num-epochs=2 --num-workers=1"
   

)

job_id = client.submit_job(
    entrypoint=kick_off_pytorch_benchmark,
)
print(job_id)

def wait_until_status(job_id, status_to_wait_for, timeout_seconds=5):
    start = time.time()
    while time.time() - start

4，这东西最好从层节点就把显卡驱动和cuda 版本定好，免得日后折腾

服务器租用托管，机房租用托管，主机租用托管，https://www.e1idc.com