跳转至

多机多卡并行

考虑到除了A100等以外的GPU没有IB网络和NVLink硬件支持,多机并行不如单机并行效率,因此不再支持开通内网IP进行多机多卡并行

如果您的计算需求在单机多卡的情况下可以满足,那么首推单机多卡并行(多机并行网络开销大,并行效率远低于单机多卡),单机多卡在同一实例中租用多卡即可,已经开机的实例关机后升降配置即可更改GPU数量。

多机多卡

查看网卡和IP

如果没有ifconfig命令,使用apt-get update && apt-get install -y net-tools安装

image-20220811143558393

不同实例的网卡名称可能是不一样的,所以最好每个实例挨个确认。以上网卡为eth1,IP为10.0.0.34

(可能存在多个网卡,请选择开通的实例独立IP及其对应的网卡,一般网卡名为eth1)

测试

测试脚本下载:

wget http://autodl-public.ks3-cn-beijing.ksyun.com/debug/ddp.py

master节点执行:

export NCCL_SOCKET_IFNAME=eth1   # 网卡名称要更换为自己的
# 以下nnodes为多个节点(即实例),nproc_per_node为当前节点上跑几个进程(即几颗GPU),node_rank为第多少个节点,master_addr为master对应实例的网卡ip地址,以上4个变量需要根据实际情况做更改。master_port可以不用调整
python -m torch.distributed.launch \
    --nproc_per_node=1 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr="10.0.0.2" \
    --master_port=55568 \
    ddp.py

worker节点执行:

export NCCL_SOCKET_IFNAME=eth1
python -m torch.distributed.launch \
    --nproc_per_node=1 \
    --nnodes=2 \
    --node_rank=1 \
    --master_addr="10.0.0.2" \
    --master_port=55568 \
    ddp.py

常见问题

如果出现错误或者Block,那么先执行环境变量export NCCL_DEBUG=INFO后再执行训练命令,观察NCCL的DEBUG日志。

如果出现以下连接拒绝的日志

container-xxx:152731:152731 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
container-xxx:152731:152731 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
container-xxx:152731:152731 [1] NCCL INFO NET/IB : No device found.
container-xxx:152731:152731 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0> [1]eth1:172.18.0.16<0>
container-xxx:152731:152731 [1] NCCL INFO Using network Socket
container-xxx:152730:152730 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
container-xxx:152730:152730 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
container-xxx:152730:152730 [0] NCCL INFO NET/IB : No device found.
container-xxx:152730:152730 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0> [1]eth1:172.18.0.16<0>
container-3e581195ae-be8c3f28:152730:152730 [0] NCCL INFO Using network Socket
container-3e581195ae-be8c3f28:152731:152775 [1] NCCL INFO Call to connect returned Connection refused, retrying
container-3e581195ae-be8c3f28:152730:152783 [0] NCCL INFO Call to connect returned Connection refused, retrying
container-3e581195ae-be8c3f28:152731:152775 [1] NCCL INFO Call to connect returned Connection refused, retrying
container-3e581195ae-be8c3f28:152730:152783 [0] NCCL INFO Call to connect returned Connection refused, retrying

那么极有可能是NCCL_SOCKET_IFNAME环境变量没有生效。这种情况可以将下面的内容写入/etc/nccl.conf配置文件,后面不再需要额外加环境变量

NCCL_SOCKET_IFNAME=eth1  # 网卡名称更换您自己的
NCCL_DEBUG=WARN

参考官方文档:https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html