The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases. Please read local_rank from `os.environ('LOCAL_RANK')` instead. INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : pretrain_gpt.py min_nodes : 4 max_nodes : 4 nproc_per_node : 8 run_id : none rdzv_backend : static rdzv_endpoint : 198.18.8.34:6000 rdzv_configs : {'rank': 3, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {} INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_tu364wid/none_fzurdyll INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=198.18.8.34 master_port=6000 group_rank=3 group_world_size=4 local_ranks=[0, 1, 2, 3, 4, 5, 6, 7] role_ranks=[24, 25, 26, 27, 28, 29, 30, 31] global_ranks=[24, 25, 26, 27, 28, 29, 30, 31] role_world_sizes=[32, 32, 32, 32, 32, 32, 32, 32] global_world_sizes=[32, 32, 32, 32, 32, 32, 32, 32] INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_tu364wid/none_fzurdyll/attempt_0/0/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_tu364wid/none_fzurdyll/attempt_0/1/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_tu364wid/none_fzurdyll/attempt_0/2/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_tu364wid/none_fzurdyll/attempt_0/3/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_tu364wid/none_fzurdyll/attempt_0/4/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_tu364wid/none_fzurdyll/attempt_0/5/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_tu364wid/none_fzurdyll/attempt_0/6/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_tu364wid/none_fzurdyll/attempt_0/7/error.json [W ProcessGroupNCCL.cpp:1671] Rank 29 using best-guess GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [W ProcessGroupNCCL.cpp:1671] Rank 26 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [W ProcessGroupNCCL.cpp:1671] Rank 28 using best-guess GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [W ProcessGroupNCCL.cpp:1671] Rank 27 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [W ProcessGroupNCCL.cpp:1671] Rank 30 using best-guess GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [W ProcessGroupNCCL.cpp:1671] Rank 25 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [W ProcessGroupNCCL.cpp:1671] Rank 24 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [W ProcessGroupNCCL.cpp:1671] Rank 31 using best-guess GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. iv-ebgyvncucvdxd0xrfapj:3380:3380 [2] NCCL INFO Bootstrap : Using eth0:192.168.10.251<0> iv-ebgyvncucvdxd0xrfapj:3385:3385 [7] NCCL INFO Bootstrap : Using eth0:192.168.10.251<0> iv-ebgyvncucvdxd0xrfapj:3380:3380 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so iv-ebgyvncucvdxd0xrfapj:3380:3380 [2] NCCL INFO P2P plugin IBext iv-ebgyvncucvdxd0xrfapj:3380:3380 [2] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. iv-ebgyvncucvdxd0xrfapj:3385:3385 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so iv-ebgyvncucvdxd0xrfapj:3385:3385 [7] NCCL INFO P2P plugin IBext iv-ebgyvncucvdxd0xrfapj:3385:3385 [7] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. iv-ebgyvncucvdxd0xrfapj:3384:3384 [6] NCCL INFO Bootstrap : Using eth0:192.168.10.251<0> iv-ebgyvncucvdxd0xrfapj:3381:3381 [3] NCCL INFO Bootstrap : Using eth0:192.168.10.251<0> iv-ebgyvncucvdxd0xrfapj:3383:3383 [5] NCCL INFO Bootstrap : Using eth0:192.168.10.251<0> iv-ebgyvncucvdxd0xrfapj:3384:3384 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so iv-ebgyvncucvdxd0xrfapj:3384:3384 [6] NCCL INFO P2P plugin IBext iv-ebgyvncucvdxd0xrfapj:3384:3384 [6] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. iv-ebgyvncucvdxd0xrfapj:3378:3378 [0] NCCL INFO Bootstrap : Using eth0:192.168.10.251<0> iv-ebgyvncucvdxd0xrfapj:3383:3383 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so iv-ebgyvncucvdxd0xrfapj:3383:3383 [5] NCCL INFO P2P plugin IBext iv-ebgyvncucvdxd0xrfapj:3383:3383 [5] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. iv-ebgyvncucvdxd0xrfapj:3381:3381 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so iv-ebgyvncucvdxd0xrfapj:3381:3381 [3] NCCL INFO P2P plugin IBext iv-ebgyvncucvdxd0xrfapj:3381:3381 [3] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. iv-ebgyvncucvdxd0xrfapj:3382:3382 [4] NCCL INFO Bootstrap : Using eth0:192.168.10.251<0> iv-ebgyvncucvdxd0xrfapj:3379:3379 [1] NCCL INFO Bootstrap : Using eth0:192.168.10.251<0> iv-ebgyvncucvdxd0xrfapj:3378:3378 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so iv-ebgyvncucvdxd0xrfapj:3378:3378 [0] NCCL INFO P2P plugin IBext iv-ebgyvncucvdxd0xrfapj:3378:3378 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. iv-ebgyvncucvdxd0xrfapj:3382:3382 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so iv-ebgyvncucvdxd0xrfapj:3382:3382 [4] NCCL INFO P2P plugin IBext iv-ebgyvncucvdxd0xrfapj:3382:3382 [4] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. iv-ebgyvncucvdxd0xrfapj:3379:3379 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so iv-ebgyvncucvdxd0xrfapj:3379:3379 [1] NCCL INFO P2P plugin IBext iv-ebgyvncucvdxd0xrfapj:3379:3379 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. iv-ebgyvncucvdxd0xrfapj:3380:3380 [2] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.10.251<0> iv-ebgyvncucvdxd0xrfapj:3380:3380 [2] NCCL INFO Using network IBext iv-ebgyvncucvdxd0xrfapj:3384:3384 [6] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.10.251<0> iv-ebgyvncucvdxd0xrfapj:3384:3384 [6] NCCL INFO Using network IBext iv-ebgyvncucvdxd0xrfapj:3385:3385 [7] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.10.251<0> iv-ebgyvncucvdxd0xrfapj:3385:3385 [7] NCCL INFO Using network IBext iv-ebgyvncucvdxd0xrfapj:3383:3383 [5] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.10.251<0> iv-ebgyvncucvdxd0xrfapj:3383:3383 [5] NCCL INFO Using network IBext iv-ebgyvncucvdxd0xrfapj:3381:3381 [3] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.10.251<0> iv-ebgyvncucvdxd0xrfapj:3381:3381 [3] NCCL INFO Using network IBext iv-ebgyvncucvdxd0xrfapj:3378:3378 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.10.251<0> iv-ebgyvncucvdxd0xrfapj:3378:3378 [0] NCCL INFO Using network IBext iv-ebgyvncucvdxd0xrfapj:3379:3379 [1] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.10.251<0> iv-ebgyvncucvdxd0xrfapj:3379:3379 [1] NCCL INFO Using network IBext iv-ebgyvncucvdxd0xrfapj:3382:3382 [4] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.10.251<0> iv-ebgyvncucvdxd0xrfapj:3382:3382 [4] NCCL INFO Using network IBext iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. iv-ebgyvncucvdxd0xrfapj:3379:3581 [1] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. iv-ebgyvncucvdxd0xrfapj:3385:3577 [7] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. iv-ebgyvncucvdxd0xrfapj:3381:3579 [3] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. iv-ebgyvncucvdxd0xrfapj:3384:3574 [6] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. iv-ebgyvncucvdxd0xrfapj:3385:3577 [7] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23. iv-ebgyvncucvdxd0xrfapj:3385:3577 [7] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7. iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23. iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23. iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7. iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7. iv-ebgyvncucvdxd0xrfapj:3379:3581 [1] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23. iv-ebgyvncucvdxd0xrfapj:3381:3579 [3] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23. iv-ebgyvncucvdxd0xrfapj:3379:3581 [1] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7. iv-ebgyvncucvdxd0xrfapj:3381:3579 [3] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7. iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23. iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7. iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23. iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7. iv-ebgyvncucvdxd0xrfapj:3384:3574 [6] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23. iv-ebgyvncucvdxd0xrfapj:3384:3574 [6] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7. iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO Trees [0] 27/-1/-1->26->24 [1] 27/-1/-1->26->24 iv-ebgyvncucvdxd0xrfapj:3379:3581 [1] NCCL INFO Trees [0] 30/-1/-1->25->27 [1] 30/-1/-1->25->27 iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO Trees [0] 26/-1/-1->24->31 [1] 26/-1/-1->24->31 iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO Trees [0] 29/-1/-1->28->20 [1] 29/12/-1->28->-1 iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO Trees [0] 31/-1/-1->29->28 [1] 31/-1/-1->29->28 iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO Setting affinity for GPU 2 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3379:3581 [1] NCCL INFO Setting affinity for GPU 1 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3381:3579 [3] NCCL INFO Trees [0] 25/-1/-1->27->26 [1] 25/-1/-1->27->26 iv-ebgyvncucvdxd0xrfapj:3384:3574 [6] NCCL INFO Trees [0] -1/-1/-1->30->25 [1] -1/-1/-1->30->25 iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO Setting affinity for GPU 4 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO Setting affinity for GPU 5 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3381:3579 [3] NCCL INFO Setting affinity for GPU 3 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3384:3574 [6] NCCL INFO Setting affinity for GPU 6 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3385:3577 [7] NCCL INFO Trees [0] 24/-1/-1->31->29 [1] 24/-1/-1->31->29 iv-ebgyvncucvdxd0xrfapj:3385:3577 [7] NCCL INFO Setting affinity for GPU 7 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3384:3574 [6] NCCL INFO Channel 00 : 30[6b010] -> 31[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO Channel 00 : 28[69010] -> 30[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO Channel 00 : 26[67010] -> 29[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO Channel 00 : 24[65010] -> 25[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3384:3574 [6] NCCL INFO Channel 01 : 30[6b010] -> 31[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO Channel 01 : 28[69010] -> 30[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO Channel 01 : 26[67010] -> 29[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO Channel 01 : 24[65010] -> 25[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:3577 [7] NCCL INFO Channel 00 : 31[6b020] -> 24[65010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO Channel 00 : 29[69020] -> 4[69010] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3379:3581 [1] NCCL INFO Channel 00 : 25[65020] -> 27[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:3577 [7] NCCL INFO Channel 01 : 31[6b020] -> 24[65010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO Channel 01 : 29[69020] -> 4[69010] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3379:3581 [1] NCCL INFO Channel 01 : 25[65020] -> 27[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3384:3574 [6] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3385:3577 [7] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3379:3581 [1] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO Channel 00 : 21[69020] -> 28[69010] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO Channel 00 : 24[65010] -> 26[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:3581 [1] NCCL INFO Channel 00 : 25[65020] -> 30[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO Channel 01 : 21[69020] -> 28[69010] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3381:3579 [3] NCCL INFO Channel 00 : 27[67020] -> 26[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO Channel 01 : 24[65010] -> 26[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:3581 [1] NCCL INFO Channel 01 : 25[65020] -> 30[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3579 [3] NCCL INFO Channel 01 : 27[67020] -> 26[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3381:3579 [3] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3384:3574 [6] NCCL INFO Channel 00 : 30[6b010] -> 25[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO Channel 00 : 26[67010] -> 27[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3384:3574 [6] NCCL INFO Channel 01 : 30[6b010] -> 25[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO Channel 01 : 26[67010] -> 27[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO Channel 00 : 28[69010] -> 29[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3384:3574 [6] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3384:3574 [6] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3384:3574 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO Channel 01 : 28[69010] -> 29[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3579 [3] NCCL INFO Channel 00 : 27[67020] -> 25[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3579 [3] NCCL INFO Channel 01 : 27[67020] -> 25[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO Channel 00 : 20[69010] -> 28[69010] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO Channel 00 : 29[69020] -> 31[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO Channel 00 : 26[67010] -> 24[65010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO Channel 01 : 29[69020] -> 31[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO Channel 00 : 24[65010] -> 31[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:3581 [1] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3379:3581 [1] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3379:3581 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3381:3579 [3] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3381:3579 [3] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3381:3579 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO Channel 01 : 26[67010] -> 24[65010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO Channel 01 : 24[65010] -> 31[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:3581 [1] NCCL INFO Channel 01 : 25[65020] -> 28[69010] via P2P/indirect/30[6b010] iv-ebgyvncucvdxd0xrfapj:3381:3579 [3] NCCL INFO Channel 00 : 27[67020] -> 29[69020] via P2P/indirect/28[69010] iv-ebgyvncucvdxd0xrfapj:3385:3577 [7] NCCL INFO Channel 00 : 31[6b020] -> 29[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO Channel 01 : 12[69010] -> 28[69010] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3385:3577 [7] NCCL INFO Channel 01 : 31[6b020] -> 29[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO Channel 01 : 28[69010] -> 12[69010] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3385:3577 [7] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3385:3577 [7] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3385:3577 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO Channel 00 : 24[65010] -> 28[69010] via P2P/indirect/27[67020] iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO Channel 00 : 29[69020] -> 28[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO Channel 00 : 26[67010] -> 28[69010] via P2P/indirect/29[69020] iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO Channel 01 : 29[69020] -> 28[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO Channel 00 : 28[69010] -> 20[69010] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3381:3579 [3] NCCL INFO Channel 01 : 27[67020] -> 30[6b010] via P2P/indirect/25[65020] iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO Channel 00 : 26[67010] -> 30[6b010] via P2P/indirect/25[65020] iv-ebgyvncucvdxd0xrfapj:3381:3579 [3] NCCL INFO Channel 00 : 27[67020] -> 31[6b020] via P2P/indirect/24[65010] iv-ebgyvncucvdxd0xrfapj:3379:3581 [1] NCCL INFO Channel 00 : 25[65020] -> 29[69020] via P2P/indirect/30[6b010] iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO Channel 01 : 26[67010] -> 31[6b020] via P2P/indirect/24[65010] iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO Channel 00 : 28[69010] -> 24[65010] via P2P/indirect/31[6b020] iv-ebgyvncucvdxd0xrfapj:3379:3581 [1] NCCL INFO Channel 00 : 25[65020] -> 31[6b020] via P2P/indirect/24[65010] iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO Channel 01 : 24[65010] -> 29[69020] via P2P/indirect/31[6b020] iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO Channel 00 : 24[65010] -> 30[6b010] via P2P/indirect/25[65020] iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO Channel 01 : 29[69020] -> 24[65010] via P2P/indirect/31[6b020] iv-ebgyvncucvdxd0xrfapj:3385:3577 [7] NCCL INFO Channel 00 : 31[6b020] -> 25[65020] via P2P/indirect/30[6b010] iv-ebgyvncucvdxd0xrfapj:3384:3574 [6] NCCL INFO Channel 00 : 30[6b010] -> 24[65010] via P2P/indirect/31[6b020] iv-ebgyvncucvdxd0xrfapj:3384:3574 [6] NCCL INFO Channel 00 : 30[6b010] -> 26[67010] via P2P/indirect/29[69020] iv-ebgyvncucvdxd0xrfapj:3385:3577 [7] NCCL INFO Channel 01 : 31[6b020] -> 26[67010] via P2P/indirect/24[65010] iv-ebgyvncucvdxd0xrfapj:3385:3577 [7] NCCL INFO Channel 00 : 31[6b020] -> 27[67020] via P2P/indirect/24[65010] iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO Channel 00 : 29[69020] -> 25[65020] via P2P/indirect/30[6b010] iv-ebgyvncucvdxd0xrfapj:3384:3574 [6] NCCL INFO Channel 01 : 30[6b010] -> 27[67020] via P2P/indirect/25[65020] iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO Channel 00 : 29[69020] -> 27[67020] via P2P/indirect/26[67010] iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO Channel 01 : 28[69010] -> 25[65020] via P2P/indirect/30[6b010] iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO Channel 00 : 28[69010] -> 26[67010] via P2P/indirect/27[67020] iv-ebgyvncucvdxd0xrfapj:3380:3572 [2] NCCL INFO comm 0x7f8378008fb0 rank 26 nranks 32 cudaDev 2 busId 67010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3383:3578 [5] NCCL INFO comm 0x7fda6c008fb0 rank 29 nranks 32 cudaDev 5 busId 69020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3385:3577 [7] NCCL INFO comm 0x7f65b4008fb0 rank 31 nranks 32 cudaDev 7 busId 6b020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3378:3580 [0] NCCL INFO comm 0x7f3e18008fb0 rank 24 nranks 32 cudaDev 0 busId 65010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3379:3581 [1] NCCL INFO comm 0x7f8338008fb0 rank 25 nranks 32 cudaDev 1 busId 65020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3384:3574 [6] NCCL INFO comm 0x7f79b4008fb0 rank 30 nranks 32 cudaDev 6 busId 6b010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3382:3582 [4] NCCL INFO comm 0x7f21d4008fb0 rank 28 nranks 32 cudaDev 4 busId 69010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3381:3579 [3] NCCL INFO comm 0x7f97bc008fb0 rank 27 nranks 32 cudaDev 3 busId 67020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3378:3622 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 iv-ebgyvncucvdxd0xrfapj:3378:3622 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3378:3622 [0] NCCL INFO Channel 00 : 0[65010] -> 1[65010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3378:3622 [0] NCCL INFO Channel 01 : 0[65010] -> 1[65010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3381:3623 [3] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 iv-ebgyvncucvdxd0xrfapj:3381:3623 [3] NCCL INFO Setting affinity for GPU 3 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3384:3624 [6] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 iv-ebgyvncucvdxd0xrfapj:3384:3624 [6] NCCL INFO Setting affinity for GPU 6 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3378:3622 [0] NCCL INFO Channel 00 : 1[65010] -> 0[65010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3381:3623 [3] NCCL INFO Channel 00 : 0[67020] -> 1[67020] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3379:3627 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 iv-ebgyvncucvdxd0xrfapj:3379:3627 [1] NCCL INFO Setting affinity for GPU 1 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3383:3628 [5] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 iv-ebgyvncucvdxd0xrfapj:3383:3628 [5] NCCL INFO Setting affinity for GPU 5 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3382:3626 [4] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 iv-ebgyvncucvdxd0xrfapj:3382:3626 [4] NCCL INFO Setting affinity for GPU 4 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3384:3624 [6] NCCL INFO Channel 00 : 0[6b010] -> 1[6b010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3378:3622 [0] NCCL INFO Channel 01 : 1[65010] -> 0[65010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3380:3634 [2] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 iv-ebgyvncucvdxd0xrfapj:3380:3634 [2] NCCL INFO Setting affinity for GPU 2 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3383:3628 [5] NCCL INFO Channel 00 : 0[69020] -> 1[69020] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3381:3623 [3] NCCL INFO Channel 01 : 0[67020] -> 1[67020] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3383:3628 [5] NCCL INFO Channel 01 : 0[69020] -> 1[69020] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3382:3626 [4] NCCL INFO Channel 00 : 0[69010] -> 1[69010] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3385:3635 [7] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 iv-ebgyvncucvdxd0xrfapj:3385:3635 [7] NCCL INFO Setting affinity for GPU 7 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3383:3628 [5] NCCL INFO Channel 00 : 1[69020] -> 0[69020] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3382:3626 [4] NCCL INFO Channel 01 : 0[69010] -> 1[69010] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3384:3624 [6] NCCL INFO Channel 01 : 0[6b010] -> 1[6b010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3379:3627 [1] NCCL INFO Channel 00 : 0[65020] -> 1[65020] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3382:3626 [4] NCCL INFO Channel 00 : 1[69010] -> 0[69010] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3383:3628 [5] NCCL INFO Channel 01 : 1[69020] -> 0[69020] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3382:3626 [4] NCCL INFO Channel 01 : 1[69010] -> 0[69010] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3381:3623 [3] NCCL INFO Channel 00 : 1[67020] -> 0[67020] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3380:3634 [2] NCCL INFO Channel 00 : 0[67010] -> 1[67010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3378:3622 [0] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3378:3622 [0] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3378:3622 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3378:3622 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3378:3622 [0] NCCL INFO comm 0x7f3df8008fb0 rank 1 nranks 2 cudaDev 0 busId 65010 - Init COMPLETE > number of parameters on (tensor, pipeline) model parallel rank (0, 3): 31900160 iv-ebgyvncucvdxd0xrfapj:3384:3624 [6] NCCL INFO Channel 00 : 1[6b010] -> 0[6b010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3385:3635 [7] NCCL INFO Channel 00 : 0[6b020] -> 1[6b020] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3381:3623 [3] NCCL INFO Channel 01 : 1[67020] -> 0[67020] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3380:3634 [2] NCCL INFO Channel 01 : 0[67010] -> 1[67010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3379:3627 [1] NCCL INFO Channel 01 : 0[65020] -> 1[65020] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3385:3635 [7] NCCL INFO Channel 01 : 0[6b020] -> 1[6b020] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3384:3624 [6] NCCL INFO Channel 01 : 1[6b010] -> 0[6b010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3380:3634 [2] NCCL INFO Channel 00 : 1[67010] -> 0[67010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3379:3627 [1] NCCL INFO Channel 00 : 1[65020] -> 0[65020] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3381:3623 [3] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3381:3623 [3] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3381:3623 [3] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3381:3623 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3381:3623 [3] NCCL INFO comm 0x7f97a4008fb0 rank 1 nranks 2 cudaDev 3 busId 67020 - Init COMPLETE > number of parameters on (tensor, pipeline) model parallel rank (3, 3): 31900160 iv-ebgyvncucvdxd0xrfapj:3385:3635 [7] NCCL INFO Channel 00 : 1[6b020] -> 0[6b020] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3379:3627 [1] NCCL INFO Channel 01 : 1[65020] -> 0[65020] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3380:3634 [2] NCCL INFO Channel 01 : 1[67010] -> 0[67010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3384:3624 [6] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3384:3624 [6] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3384:3624 [6] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3384:3624 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3384:3624 [6] NCCL INFO comm 0x7f79a0008fb0 rank 1 nranks 2 cudaDev 6 busId 6b010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3379:3627 [1] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3379:3627 [1] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3379:3627 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3379:3627 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3379:3627 [1] NCCL INFO comm 0x7f8320008fb0 rank 1 nranks 2 cudaDev 1 busId 65020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3380:3634 [2] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3380:3634 [2] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3380:3634 [2] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3380:3634 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3380:3634 [2] NCCL INFO comm 0x7f836c008fb0 rank 1 nranks 2 cudaDev 2 busId 67010 - Init COMPLETE > number of parameters on (tensor, pipeline) model parallel rank (1, 3): 31900160 > number of parameters on (tensor, pipeline) model parallel rank (2, 3): 31900160 iv-ebgyvncucvdxd0xrfapj:3385:3635 [7] NCCL INFO Channel 01 : 1[6b020] -> 0[6b020] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3385:3635 [7] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3385:3635 [7] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3385:3635 [7] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3385:3635 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3385:3635 [7] NCCL INFO comm 0x7f6594008fb0 rank 1 nranks 2 cudaDev 7 busId 6b020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3383:3628 [5] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3383:3628 [5] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3383:3628 [5] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3383:3628 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3383:3628 [5] NCCL INFO comm 0x7fda44008fb0 rank 1 nranks 2 cudaDev 5 busId 69020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3382:3626 [4] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3382:3626 [4] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3382:3626 [4] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3382:3626 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3382:3626 [4] NCCL INFO comm 0x7f21b4008fb0 rank 1 nranks 2 cudaDev 4 busId 69010 - Init COMPLETE NCCL version 2.10.3+cuda11.4 iv-ebgyvncucvdxd0xrfapj:3378:3647 [0] NCCL INFO Channel 00/02 : 0 1 iv-ebgyvncucvdxd0xrfapj:3382:3648 [4] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 iv-ebgyvncucvdxd0xrfapj:3378:3647 [0] NCCL INFO Channel 01/02 : 0 1 iv-ebgyvncucvdxd0xrfapj:3382:3648 [4] NCCL INFO Setting affinity for GPU 4 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3378:3647 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 iv-ebgyvncucvdxd0xrfapj:3378:3647 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3378:3647 [0] NCCL INFO Channel 00 : 0[65010] -> 1[69010] via direct shared memory iv-ebgyvncucvdxd0xrfapj:3382:3648 [4] NCCL INFO Channel 00 : 1[69010] -> 0[65010] via direct shared memory iv-ebgyvncucvdxd0xrfapj:3378:3647 [0] NCCL INFO Channel 01 : 0[65010] -> 1[69010] via direct shared memory iv-ebgyvncucvdxd0xrfapj:3382:3648 [4] NCCL INFO Channel 01 : 1[69010] -> 0[65010] via direct shared memory iv-ebgyvncucvdxd0xrfapj:3382:3648 [4] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3382:3648 [4] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3382:3648 [4] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3382:3648 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3378:3647 [0] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3378:3647 [0] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3378:3647 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3378:3647 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3382:3648 [4] NCCL INFO comm 0x7f21b40be010 rank 1 nranks 2 cudaDev 4 busId 69010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3378:3647 [0] NCCL INFO comm 0x7f3dd0008fb0 rank 0 nranks 2 cudaDev 0 busId 65010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3378:3378 [0] NCCL INFO Launch mode Parallel iv-ebgyvncucvdxd0xrfapj:3378:3653 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] 1/-1/-1->3->-1 iv-ebgyvncucvdxd0xrfapj:3378:3653 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3378:3653 [0] NCCL INFO Channel 00 : 2[65010] -> 3[65010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3382:3654 [4] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] 1/-1/-1->3->-1 iv-ebgyvncucvdxd0xrfapj:3382:3654 [4] NCCL INFO Setting affinity for GPU 4 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3382:3654 [4] NCCL INFO Channel 00 : 2[69010] -> 3[69010] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3382:3654 [4] NCCL INFO Channel 01 : 2[69010] -> 3[69010] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3378:3653 [0] NCCL INFO Channel 01 : 2[65010] -> 3[65010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3382:3654 [4] NCCL INFO Channel 00 : 3[69010] -> 0[69010] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3382:3654 [4] NCCL INFO Channel 01 : 3[69010] -> 0[69010] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3378:3653 [0] NCCL INFO Channel 00 : 3[65010] -> 0[65010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3378:3653 [0] NCCL INFO Channel 01 : 3[65010] -> 0[65010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3378:3653 [0] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3378:3653 [0] NCCL INFO Channel 01 : 1[65010] -> 3[65010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3378:3653 [0] NCCL INFO Channel 01 : 3[65010] -> 1[65010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3382:3654 [4] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3378:3653 [0] NCCL INFO Channel 00 : 3[65010] -> 2[65010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3382:3654 [4] NCCL INFO Channel 01 : 1[69010] -> 3[69010] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3382:3654 [4] NCCL INFO Channel 01 : 3[69010] -> 1[69010] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3378:3653 [0] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3378:3653 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3378:3653 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3378:3653 [0] NCCL INFO comm 0x7f3dd00cd430 rank 3 nranks 4 cudaDev 0 busId 65010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3382:3654 [4] NCCL INFO Channel 00 : 3[69010] -> 2[69010] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3382:3654 [4] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3382:3654 [4] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3382:3654 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3382:3654 [4] NCCL INFO comm 0x7f21b410efc0 rank 3 nranks 4 cudaDev 4 busId 69010 - Init COMPLETE NCCL version 2.10.3+cuda11.4 iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 00/08 : 0 1 3 2 iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 01/08 : 0 3 1 2 iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 02/08 : 0 2 3 1 iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 03/08 : 0 2 1 3 iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 04/08 : 0 1 3 2 iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 05/08 : 0 3 1 2 iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 06/08 : 0 2 3 1 iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Trees [0] 3/-1/-1->1->0 [1] 0/-1/-1->1->3 [2] 3/-1/-1->1->0 [3] 0/-1/-1->1->3 [4] 3/-1/-1->1->0 [5] 0/-1/-1->1->3 [6] 3/-1/-1->1->0 [7] 0/-1/-1->1->3 iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 07/08 : 0 2 1 3 iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Setting affinity for GPU 5 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO Trees [0] 0/-1/-1->2->-1 [1] -1/-1/-1->2->0 [2] 0/-1/-1->2->-1 [3] -1/-1/-1->2->0 [4] 0/-1/-1->2->-1 [5] -1/-1/-1->2->0 [6] 0/-1/-1->2->-1 [7] -1/-1/-1->2->0 iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Trees [0] 1/-1/-1->0->2 [1] 2/-1/-1->0->1 [2] 1/-1/-1->0->2 [3] 2/-1/-1->0->1 [4] 1/-1/-1->0->2 [5] 2/-1/-1->0->1 [6] 1/-1/-1->0->2 [7] 2/-1/-1->0->1 iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO Setting affinity for GPU 6 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Setting affinity for GPU 4 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO Trees [0] -1/-1/-1->3->1 [1] 1/-1/-1->3->-1 [2] -1/-1/-1->3->1 [3] 1/-1/-1->3->-1 [4] -1/-1/-1->3->1 [5] 1/-1/-1->3->-1 [6] -1/-1/-1->3->1 [7] 1/-1/-1->3->-1 iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO Setting affinity for GPU 7 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 00 : 0[69010] -> 1[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Channel 01 : 1[69020] -> 2[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 04 : 0[69010] -> 1[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Channel 05 : 1[69020] -> 2[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO Channel 03 : 3[6b020] -> 0[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO Channel 07 : 3[6b020] -> 0[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO Channel 02 : 2[6b010] -> 3[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO Channel 06 : 2[6b010] -> 3[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 02 : 0[69010] -> 2[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 03 : 0[69010] -> 2[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Channel 00 : 1[69020] -> 3[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO Channel 00 : 2[6b010] -> 0[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 06 : 0[69010] -> 2[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO Channel 01 : 3[6b020] -> 1[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Channel 03 : 1[69020] -> 3[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 07 : 0[69010] -> 2[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO Channel 01 : 2[6b010] -> 0[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO Channel 02 : 3[6b020] -> 1[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Channel 04 : 1[69020] -> 3[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO Channel 04 : 2[6b010] -> 0[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO Channel 05 : 3[6b020] -> 1[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Channel 07 : 1[69020] -> 3[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO Channel 05 : 2[6b010] -> 0[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO Channel 06 : 3[6b020] -> 1[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 01 : 0[69010] -> 3[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO Channel 03 : 2[6b010] -> 1[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Channel 02 : 1[69020] -> 0[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO Channel 00 : 3[6b020] -> 2[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 05 : 0[69010] -> 3[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO Channel 07 : 2[6b010] -> 1[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Channel 06 : 1[69020] -> 0[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO Channel 04 : 3[6b020] -> 2[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 01 : 0[69010] -> 1[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 02 : 0[69010] -> 1[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 03 : 0[69010] -> 1[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 05 : 0[69010] -> 1[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 06 : 0[69010] -> 1[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO Channel 00 : 3[6b020] -> 1[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 07 : 0[69010] -> 1[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO Channel 03 : 3[6b020] -> 1[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO Channel 04 : 3[6b020] -> 1[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO Channel 02 : 2[6b010] -> 0[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO Channel 07 : 3[6b020] -> 1[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO Channel 03 : 2[6b010] -> 0[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 00/08 : 0 2 3 1 iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 01/08 : 0 2 1 3 iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 02/08 : 0 1 3 2 iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 03/08 : 0 3 1 2 iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 04/08 : 0 2 3 1 iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 05/08 : 0 2 1 3 iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 06/08 : 0 1 3 2 iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 07/08 : 0 3 1 2 iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO Channel 06 : 2[6b010] -> 0[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Trees [0] 2/-1/-1->0->-1 [1] -1/-1/-1->0->2 [2] 2/-1/-1->0->-1 [3] -1/-1/-1->0->2 [4] 2/-1/-1->0->-1 [5] -1/-1/-1->0->2 [6] 2/-1/-1->0->-1 [7] -1/-1/-1->0->2 iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO Trees [0] -1/-1/-1->1->3 [1] 3/-1/-1->1->-1 [2] -1/-1/-1->1->3 [3] 3/-1/-1->1->-1 [4] -1/-1/-1->1->3 [5] 3/-1/-1->1->-1 [6] -1/-1/-1->1->3 [7] 3/-1/-1->1->-1 iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Trees [0] 1/-1/-1->3->2 [1] 2/-1/-1->3->1 [2] 1/-1/-1->3->2 [3] 2/-1/-1->3->1 [4] 1/-1/-1->3->2 [5] 2/-1/-1->3->1 [6] 1/-1/-1->3->2 [7] 2/-1/-1->3->1 iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO Setting affinity for GPU 1 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Setting affinity for GPU 3 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 0/-1/-1->2->3 [2] 3/-1/-1->2->0 [3] 0/-1/-1->2->3 [4] 3/-1/-1->2->0 [5] 0/-1/-1->2->3 [6] 3/-1/-1->2->0 [7] 0/-1/-1->2->3 iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Setting affinity for GPU 2 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO Channel 07 : 2[6b010] -> 0[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 00 : 0[69010] -> 2[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Channel 01 : 1[69020] -> 3[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 01 : 0[69010] -> 2[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Channel 02 : 1[69020] -> 3[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 04 : 0[69010] -> 2[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Channel 05 : 1[69020] -> 3[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Channel 05 : 0[69010] -> 2[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Channel 06 : 1[69020] -> 3[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Channel 01 : 3[67020] -> 0[65010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 02 : 0[65010] -> 1[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Channel 00 : 2[67010] -> 3[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO Channel 03 : 1[65020] -> 2[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Channel 05 : 3[67020] -> 0[65010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 06 : 0[65010] -> 1[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Channel 04 : 2[67010] -> 3[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO Channel 07 : 1[65020] -> 2[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Channel 00 : 1[69020] -> 0[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Channel 01 : 1[69020] -> 0[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Channel 03 : 1[69020] -> 0[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Channel 04 : 1[69020] -> 0[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Channel 05 : 1[69020] -> 0[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Channel 07 : 1[69020] -> 0[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 00 : 0[65010] -> 2[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Channel 02 : 2[67010] -> 0[65010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Channel 00 : 3[67020] -> 1[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO Channel 01 : 1[65020] -> 3[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 01 : 0[65010] -> 2[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Channel 03 : 2[67010] -> 0[65010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Channel 03 : 3[67020] -> 1[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO Channel 02 : 1[65020] -> 3[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 04 : 0[65010] -> 2[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Channel 06 : 2[67010] -> 0[65010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Channel 04 : 3[67020] -> 1[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO Channel 05 : 1[65020] -> 3[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3660 [4] NCCL INFO comm 0x7f2178008fb0 rank 0 nranks 4 cudaDev 4 busId 69010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3383:3662 [5] NCCL INFO comm 0x7fda28008fb0 rank 1 nranks 4 cudaDev 5 busId 69020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3385:3661 [7] NCCL INFO comm 0x7f65940d3010 rank 3 nranks 4 cudaDev 7 busId 6b020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3384:3663 [6] NCCL INFO comm 0x7f7980008fb0 rank 2 nranks 4 cudaDev 6 busId 6b010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 05 : 0[65010] -> 2[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:3382 [4] NCCL INFO Launch mode Parallel iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Channel 07 : 2[67010] -> 0[65010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Channel 07 : 3[67020] -> 1[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO Channel 06 : 1[65020] -> 3[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 03 : 0[65010] -> 3[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Channel 01 : 2[67010] -> 1[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Channel 02 : 3[67020] -> 2[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO Channel 00 : 1[65020] -> 0[65010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Channel 05 : 2[67010] -> 1[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 07 : 0[65010] -> 3[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Channel 06 : 3[67020] -> 2[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO Channel 04 : 1[65020] -> 0[65010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Channel 01 : 2[67010] -> 3[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Channel 02 : 2[67010] -> 3[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Channel 03 : 2[67010] -> 3[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Channel 05 : 2[67010] -> 3[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 02 : 0[65010] -> 2[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Channel 06 : 2[67010] -> 3[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Channel 07 : 2[67010] -> 3[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 03 : 0[65010] -> 2[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO Channel 00 : 1[65020] -> 3[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 06 : 0[65010] -> 2[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO Channel 03 : 1[65020] -> 3[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Channel 07 : 0[65010] -> 2[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO Channel 04 : 1[65020] -> 3[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO Channel 07 : 1[65020] -> 3[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Channel 01 : 3[67020] -> 1[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Channel 00 : 2[67010] -> 0[65010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Channel 02 : 3[67020] -> 1[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Channel 01 : 2[67010] -> 0[65010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Channel 05 : 3[67020] -> 1[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Channel 04 : 2[67010] -> 0[65010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Channel 06 : 3[67020] -> 1[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Channel 05 : 2[67010] -> 0[65010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Channel 00 : 3[67020] -> 2[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Channel 01 : 3[67020] -> 2[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Channel 03 : 3[67020] -> 2[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Channel 04 : 3[67020] -> 2[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Channel 05 : 3[67020] -> 2[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Channel 07 : 3[67020] -> 2[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3378:3665 [0] NCCL INFO comm 0x7f3dc0008fb0 rank 0 nranks 4 cudaDev 0 busId 65010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3380:3667 [2] NCCL INFO comm 0x7f834c008fb0 rank 2 nranks 4 cudaDev 2 busId 67010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3381:3666 [3] NCCL INFO comm 0x7f9784008fb0 rank 3 nranks 4 cudaDev 3 busId 67020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3379:3668 [1] NCCL INFO comm 0x7f8300008fb0 rank 1 nranks 4 cudaDev 1 busId 65020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3378:3378 [0] NCCL INFO Launch mode Parallel [W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) time (ms) | model-and-optimizer-setup: 568.96 | train/valid/test-data-iterators-setup: 1061.03 iv-ebgyvncucvdxd0xrfapj:3381:4250 [3] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 iv-ebgyvncucvdxd0xrfapj:3381:4250 [3] NCCL INFO Setting affinity for GPU 3 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3378:4252 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 iv-ebgyvncucvdxd0xrfapj:3378:4252 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3380:4253 [2] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 iv-ebgyvncucvdxd0xrfapj:3380:4253 [2] NCCL INFO Setting affinity for GPU 2 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3381:4250 [3] NCCL INFO Channel 00 : 0[67020] -> 1[67020] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3378:4252 [0] NCCL INFO Channel 00 : 0[65010] -> 1[65010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3379:4251 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 iv-ebgyvncucvdxd0xrfapj:3379:4251 [1] NCCL INFO Setting affinity for GPU 1 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3385:4258 [7] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 iv-ebgyvncucvdxd0xrfapj:3385:4258 [7] NCCL INFO Setting affinity for GPU 7 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3380:4253 [2] NCCL INFO Channel 00 : 0[67010] -> 1[67010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3383:4261 [5] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 iv-ebgyvncucvdxd0xrfapj:3383:4261 [5] NCCL INFO Setting affinity for GPU 5 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3381:4250 [3] NCCL INFO Channel 01 : 0[67020] -> 1[67020] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3378:4252 [0] NCCL INFO Channel 01 : 0[65010] -> 1[65010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3383:4261 [5] NCCL INFO Channel 00 : 0[69020] -> 1[69020] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3382:4260 [4] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 iv-ebgyvncucvdxd0xrfapj:3382:4260 [4] NCCL INFO Setting affinity for GPU 4 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3383:4261 [5] NCCL INFO Channel 01 : 0[69020] -> 1[69020] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3383:4261 [5] NCCL INFO Channel 00 : 1[69020] -> 0[69020] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3380:4253 [2] NCCL INFO Channel 01 : 0[67010] -> 1[67010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3384:4259 [6] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 iv-ebgyvncucvdxd0xrfapj:3384:4259 [6] NCCL INFO Setting affinity for GPU 6 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3383:4261 [5] NCCL INFO Channel 01 : 1[69020] -> 0[69020] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3385:4258 [7] NCCL INFO Channel 00 : 0[6b020] -> 1[6b020] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3379:4251 [1] NCCL INFO Channel 00 : 0[65020] -> 1[65020] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3381:4250 [3] NCCL INFO Channel 00 : 1[67020] -> 0[67020] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3382:4260 [4] NCCL INFO Channel 00 : 0[69010] -> 1[69010] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3382:4260 [4] NCCL INFO Channel 01 : 0[69010] -> 1[69010] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3382:4260 [4] NCCL INFO Channel 00 : 1[69010] -> 0[69010] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3378:4252 [0] NCCL INFO Channel 00 : 1[65010] -> 0[65010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3382:4260 [4] NCCL INFO Channel 01 : 1[69010] -> 0[69010] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3380:4253 [2] NCCL INFO Channel 00 : 1[67010] -> 0[67010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3385:4258 [7] NCCL INFO Channel 01 : 0[6b020] -> 1[6b020] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3384:4259 [6] NCCL INFO Channel 00 : 0[6b010] -> 1[6b010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3379:4251 [1] NCCL INFO Channel 01 : 0[65020] -> 1[65020] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3381:4250 [3] NCCL INFO Channel 01 : 1[67020] -> 0[67020] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3380:4253 [2] NCCL INFO Channel 01 : 1[67010] -> 0[67010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3378:4252 [0] NCCL INFO Channel 01 : 1[65010] -> 0[65010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3385:4258 [7] NCCL INFO Channel 00 : 1[6b020] -> 0[6b020] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3381:4250 [3] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3381:4250 [3] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3381:4250 [3] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3381:4250 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3381:4250 [3] NCCL INFO comm 0x7f9760008fb0 rank 1 nranks 2 cudaDev 3 busId 67020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3378:4252 [0] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3378:4252 [0] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3378:4252 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3378:4252 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3378:4252 [0] NCCL INFO comm 0x7f3d74008fb0 rank 1 nranks 2 cudaDev 0 busId 65010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3379:4251 [1] NCCL INFO Channel 00 : 1[65020] -> 0[65020] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3384:4259 [6] NCCL INFO Channel 01 : 0[6b010] -> 1[6b010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3380:4253 [2] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3380:4253 [2] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3380:4253 [2] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3380:4253 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3380:4253 [2] NCCL INFO comm 0x7f8328008fb0 rank 1 nranks 2 cudaDev 2 busId 67010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3379:4251 [1] NCCL INFO Channel 01 : 1[65020] -> 0[65020] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3384:4259 [6] NCCL INFO Channel 00 : 1[6b010] -> 0[6b010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3385:4258 [7] NCCL INFO Channel 01 : 1[6b020] -> 0[6b020] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3385:4258 [7] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3385:4258 [7] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3385:4258 [7] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3385:4258 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3385:4258 [7] NCCL INFO comm 0x7f6558008fb0 rank 1 nranks 2 cudaDev 7 busId 6b020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3384:4259 [6] NCCL INFO Channel 01 : 1[6b010] -> 0[6b010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3379:4251 [1] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3379:4251 [1] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3379:4251 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3379:4251 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3379:4251 [1] NCCL INFO comm 0x7f82e4008fb0 rank 1 nranks 2 cudaDev 1 busId 65020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3384:4259 [6] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3384:4259 [6] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3384:4259 [6] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3384:4259 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3384:4259 [6] NCCL INFO comm 0x7f7964008fb0 rank 1 nranks 2 cudaDev 6 busId 6b010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3383:4261 [5] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3383:4261 [5] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3383:4261 [5] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3383:4261 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3383:4261 [5] NCCL INFO comm 0x7fda04008fb0 rank 1 nranks 2 cudaDev 5 busId 69020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3382:4260 [4] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3382:4260 [4] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3382:4260 [4] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3382:4260 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3382:4260 [4] NCCL INFO comm 0x7f2128008fb0 rank 1 nranks 2 cudaDev 4 busId 69010 - Init COMPLETE /dataset/workspace/Megatron-LM/megatron/model/transformer.py:536: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/LegacyTypeDispatch.h:74.) output = bias_dropout_add_func( /dataset/workspace/Megatron-LM/megatron/model/transformer.py:536: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/LegacyTypeDispatch.h:74.) output = bias_dropout_add_func( /dataset/workspace/Megatron-LM/megatron/model/transformer.py:536: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/LegacyTypeDispatch.h:74.) output = bias_dropout_add_func( /dataset/workspace/Megatron-LM/megatron/model/transformer.py:536: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/LegacyTypeDispatch.h:74.) output = bias_dropout_add_func( /dataset/workspace/Megatron-LM/megatron/model/transformer.py:536: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/LegacyTypeDispatch.h:74.) output = bias_dropout_add_func( /dataset/workspace/Megatron-LM/megatron/model/transformer.py:536: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/LegacyTypeDispatch.h:74.) output = bias_dropout_add_func( /dataset/workspace/Megatron-LM/megatron/model/transformer.py:536: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/LegacyTypeDispatch.h:74.) output = bias_dropout_add_func( /dataset/workspace/Megatron-LM/megatron/model/transformer.py:536: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/LegacyTypeDispatch.h:74.) output = bias_dropout_add_func( NCCL version 2.10.3+cuda11.4 NCCL version 2.10.3+cuda11.4 NCCL version 2.10.3+cuda11.4 iv-ebgyvncucvdxd0xrfapj:3380:4288 [2] NCCL INFO Channel 00/02 : 0 1 iv-ebgyvncucvdxd0xrfapj:3384:4294 [6] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 iv-ebgyvncucvdxd0xrfapj:3380:4288 [2] NCCL INFO Channel 01/02 : 0 1 iv-ebgyvncucvdxd0xrfapj:3384:4294 [6] NCCL INFO Setting affinity for GPU 6 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3380:4288 [2] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 iv-ebgyvncucvdxd0xrfapj:3380:4288 [2] NCCL INFO Setting affinity for GPU 2 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3381:4292 [3] NCCL INFO Channel 00/02 : 0 1 iv-ebgyvncucvdxd0xrfapj:3385:4293 [7] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 iv-ebgyvncucvdxd0xrfapj:3381:4292 [3] NCCL INFO Channel 01/02 : 0 1 iv-ebgyvncucvdxd0xrfapj:3385:4293 [7] NCCL INFO Setting affinity for GPU 7 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3381:4292 [3] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 iv-ebgyvncucvdxd0xrfapj:3381:4292 [3] NCCL INFO Setting affinity for GPU 3 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3383:4295 [5] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 iv-ebgyvncucvdxd0xrfapj:3379:4291 [1] NCCL INFO Channel 00/02 : 0 1 iv-ebgyvncucvdxd0xrfapj:3379:4291 [1] NCCL INFO Channel 01/02 : 0 1 iv-ebgyvncucvdxd0xrfapj:3383:4295 [5] NCCL INFO Setting affinity for GPU 5 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3379:4291 [1] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 iv-ebgyvncucvdxd0xrfapj:3379:4291 [1] NCCL INFO Setting affinity for GPU 1 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3385:4293 [7] NCCL INFO Channel 00 : 1[6b020] -> 0[67020] via direct shared memory iv-ebgyvncucvdxd0xrfapj:3385:4293 [7] NCCL INFO Channel 01 : 1[6b020] -> 0[67020] via direct shared memory iv-ebgyvncucvdxd0xrfapj:3381:4292 [3] NCCL INFO Channel 00 : 0[67020] -> 1[6b020] via direct shared memory iv-ebgyvncucvdxd0xrfapj:3381:4292 [3] NCCL INFO Channel 01 : 0[67020] -> 1[6b020] via direct shared memory iv-ebgyvncucvdxd0xrfapj:3383:4295 [5] NCCL INFO Channel 00 : 1[69020] -> 0[65020] via direct shared memory iv-ebgyvncucvdxd0xrfapj:3384:4294 [6] NCCL INFO Channel 00 : 1[6b010] -> 0[67010] via direct shared memory iv-ebgyvncucvdxd0xrfapj:3383:4295 [5] NCCL INFO Channel 01 : 1[69020] -> 0[65020] via direct shared memory iv-ebgyvncucvdxd0xrfapj:3380:4288 [2] NCCL INFO Channel 00 : 0[67010] -> 1[6b010] via direct shared memory iv-ebgyvncucvdxd0xrfapj:3384:4294 [6] NCCL INFO Channel 01 : 1[6b010] -> 0[67010] via direct shared memory iv-ebgyvncucvdxd0xrfapj:3380:4288 [2] NCCL INFO Channel 01 : 0[67010] -> 1[6b010] via direct shared memory iv-ebgyvncucvdxd0xrfapj:3385:4293 [7] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3385:4293 [7] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3385:4293 [7] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3385:4293 [7] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3381:4292 [3] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3381:4292 [3] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3381:4292 [3] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3381:4292 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3385:4293 [7] NCCL INFO comm 0x7f65580c9010 rank 1 nranks 2 cudaDev 7 busId 6b020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3381:4292 [3] NCCL INFO comm 0x7f9530008fb0 rank 0 nranks 2 cudaDev 3 busId 67020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3381:3381 [3] NCCL INFO Launch mode Parallel iv-ebgyvncucvdxd0xrfapj:3379:4291 [1] NCCL INFO Channel 00 : 0[65020] -> 1[69020] via direct shared memory iv-ebgyvncucvdxd0xrfapj:3379:4291 [1] NCCL INFO Channel 01 : 0[65020] -> 1[69020] via direct shared memory iv-ebgyvncucvdxd0xrfapj:3380:4288 [2] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3380:4288 [2] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3380:4288 [2] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3380:4288 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3384:4294 [6] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3384:4294 [6] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3384:4294 [6] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3384:4294 [6] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3384:4294 [6] NCCL INFO comm 0x7f79640c9010 rank 1 nranks 2 cudaDev 6 busId 6b010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3380:4288 [2] NCCL INFO comm 0x7f80f8008fb0 rank 0 nranks 2 cudaDev 2 busId 67010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3380:3380 [2] NCCL INFO Launch mode Parallel iv-ebgyvncucvdxd0xrfapj:3383:4295 [5] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3383:4295 [5] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3383:4295 [5] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3383:4295 [5] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3379:4291 [1] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3379:4291 [1] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3379:4291 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3379:4291 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3383:4295 [5] NCCL INFO comm 0x7fda040be010 rank 1 nranks 2 cudaDev 5 busId 69020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3379:4291 [1] NCCL INFO comm 0x7f80b4008fb0 rank 0 nranks 2 cudaDev 1 busId 65020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3379:3379 [1] NCCL INFO Launch mode Parallel iv-ebgyvncucvdxd0xrfapj:3378:4386 [0] NCCL INFO Trees [0] 13/-1/-1->12->8 [1] 13/4/-1->12->-1 iv-ebgyvncucvdxd0xrfapj:3378:4386 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3379:4388 [1] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12 iv-ebgyvncucvdxd0xrfapj:3379:4388 [1] NCCL INFO Setting affinity for GPU 1 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3381:4392 [3] NCCL INFO Trees [0] -1/-1/-1->15->14 [1] -1/-1/-1->15->14 iv-ebgyvncucvdxd0xrfapj:3380:4390 [2] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13 iv-ebgyvncucvdxd0xrfapj:3381:4392 [3] NCCL INFO Setting affinity for GPU 3 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3380:4390 [2] NCCL INFO Setting affinity for GPU 2 to 03ff,ffffffff iv-ebgyvncucvdxd0xrfapj:3382:4385 [4] NCCL INFO Trees [0] 13/-1/-1->12->8 [1] 13/4/-1->12->-1 iv-ebgyvncucvdxd0xrfapj:3382:4385 [4] NCCL INFO Setting affinity for GPU 4 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3383:4387 [5] NCCL INFO Trees [0] 15/-1/-1->13->12 [1] 15/-1/-1->13->12 iv-ebgyvncucvdxd0xrfapj:3383:4387 [5] NCCL INFO Setting affinity for GPU 5 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3385:4391 [7] NCCL INFO Trees [0] 14/-1/-1->15->13 [1] 14/-1/-1->15->13 iv-ebgyvncucvdxd0xrfapj:3384:4389 [6] NCCL INFO Trees [0] -1/-1/-1->14->15 [1] -1/-1/-1->14->15 iv-ebgyvncucvdxd0xrfapj:3384:4389 [6] NCCL INFO Setting affinity for GPU 6 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3385:4391 [7] NCCL INFO Setting affinity for GPU 7 to 0fffff,fffffc00,00000000 iv-ebgyvncucvdxd0xrfapj:3380:4390 [2] NCCL INFO Channel 00 : 14[67010] -> 15[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:4388 [1] NCCL INFO Channel 00 : 13[65020] -> 14[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:4386 [0] NCCL INFO Channel 00 : 11[67020] -> 12[65010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3380:4390 [2] NCCL INFO Channel 01 : 14[67010] -> 15[67020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:4387 [5] NCCL INFO Channel 00 : 13[69020] -> 0[69010] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3379:4388 [1] NCCL INFO Channel 01 : 13[65020] -> 14[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:4385 [4] NCCL INFO Channel 00 : 12[69010] -> 14[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:4392 [3] NCCL INFO Channel 00 : 15[67020] -> 0[65010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3382:4385 [4] NCCL INFO Channel 01 : 12[69010] -> 14[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3384:4389 [6] NCCL INFO Channel 00 : 14[6b010] -> 15[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:4387 [5] NCCL INFO Channel 01 : 13[69020] -> 0[69010] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3384:4389 [6] NCCL INFO Channel 01 : 14[6b010] -> 15[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:4386 [0] NCCL INFO Channel 01 : 11[67020] -> 12[65010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3378:4386 [0] NCCL INFO Channel 00 : 12[65010] -> 13[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:4392 [3] NCCL INFO Channel 01 : 15[67020] -> 0[65010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3378:4386 [0] NCCL INFO Channel 01 : 12[65010] -> 13[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:4391 [7] NCCL INFO Channel 00 : 15[6b020] -> 13[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:4391 [7] NCCL INFO Channel 01 : 15[6b020] -> 13[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:4390 [2] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3384:4389 [6] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3379:4388 [1] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3382:4385 [4] NCCL INFO Channel 00 : 9[69020] -> 12[69010] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3378:4386 [0] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3381:4392 [3] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3382:4385 [4] NCCL INFO Channel 01 : 9[69020] -> 12[69010] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3378:4386 [0] NCCL INFO Channel 00 : 8[65010] -> 12[65010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3381:4392 [3] NCCL INFO Channel 00 : 15[67020] -> 14[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:4390 [2] NCCL INFO Channel 00 : 14[67010] -> 13[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3379:4388 [1] NCCL INFO Channel 00 : 13[65020] -> 12[65010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3381:4392 [3] NCCL INFO Channel 01 : 15[67020] -> 14[67010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:4390 [2] NCCL INFO Channel 01 : 14[67010] -> 13[65020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3378:4386 [0] NCCL INFO Channel 01 : 4[65010] -> 12[65010] [receive] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3379:4388 [1] NCCL INFO Channel 01 : 13[65020] -> 12[65010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:4385 [4] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3378:4386 [0] NCCL INFO Channel 01 : 12[65010] -> 4[65010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3381:4392 [3] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3381:4392 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3381:4392 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3382:4385 [4] NCCL INFO Channel 00 : 12[69010] -> 13[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:4385 [4] NCCL INFO Channel 01 : 12[69010] -> 13[69020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3380:4390 [2] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3380:4390 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3380:4390 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3383:4387 [5] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3385:4391 [7] NCCL INFO Connected all rings iv-ebgyvncucvdxd0xrfapj:3383:4387 [5] NCCL INFO Channel 00 : 13[69020] -> 15[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3382:4385 [4] NCCL INFO Channel 00 : 8[69010] -> 12[69010] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3383:4387 [5] NCCL INFO Channel 01 : 13[69020] -> 15[6b020] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:4391 [7] NCCL INFO Channel 00 : 15[6b020] -> 14[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:4387 [5] NCCL INFO Channel 00 : 13[69020] -> 12[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:4391 [7] NCCL INFO Channel 01 : 15[6b020] -> 14[6b010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3383:4387 [5] NCCL INFO Channel 01 : 13[69020] -> 12[69010] via P2P/IPC iv-ebgyvncucvdxd0xrfapj:3385:4391 [7] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3385:4391 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3385:4391 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3384:4389 [6] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3384:4389 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3384:4389 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3378:4386 [0] NCCL INFO Channel 00 : 12[65010] -> 8[65010] [send] via NET/IBext/0 iv-ebgyvncucvdxd0xrfapj:3382:4385 [4] NCCL INFO Channel 01 : 4[69010] -> 12[69010] [receive] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3378:4386 [0] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3378:4386 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3378:4386 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3379:4388 [1] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3379:4388 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3379:4388 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3382:4385 [4] NCCL INFO Channel 01 : 12[69010] -> 4[69010] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3378:4386 [0] NCCL INFO comm 0x7f3a74008fb0 rank 12 nranks 16 cudaDev 0 busId 65010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3381:4392 [3] NCCL INFO comm 0x7f9444008fb0 rank 15 nranks 16 cudaDev 3 busId 67020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3380:4390 [2] NCCL INFO comm 0x7f8008008fb0 rank 14 nranks 16 cudaDev 2 busId 67010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3379:4388 [1] NCCL INFO comm 0x7f8090008fb0 rank 13 nranks 16 cudaDev 1 busId 65020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3382:4385 [4] NCCL INFO Channel 00 : 12[69010] -> 8[69010] [send] via NET/IBext/0/GDRDMA iv-ebgyvncucvdxd0xrfapj:3382:4385 [4] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3382:4385 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3382:4385 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3383:4387 [5] NCCL INFO Connected all trees iv-ebgyvncucvdxd0xrfapj:3383:4387 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 iv-ebgyvncucvdxd0xrfapj:3383:4387 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer iv-ebgyvncucvdxd0xrfapj:3383:4387 [5] NCCL INFO comm 0x7fd7ac008fb0 rank 13 nranks 16 cudaDev 5 busId 69020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3385:4391 [7] NCCL INFO comm 0x7f623c008fb0 rank 15 nranks 16 cudaDev 7 busId 6b020 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3382:4385 [4] NCCL INFO comm 0x7f1e24008fb0 rank 12 nranks 16 cudaDev 4 busId 69010 - Init COMPLETE iv-ebgyvncucvdxd0xrfapj:3384:4389 [6] NCCL INFO comm 0x7f7648008fb0 rank 14 nranks 16 cudaDev 6 busId 6b010 - Init COMPLETE iteration 100/ 210 | consumed samples: 102400 | elapsed time per iteration (ms): 5846.3 | tpt: 175.2 samples/s | global batch size: 1024 | lm loss: 1.000906E+01 | loss scale: 262144.0 | grad norm: 1.449 | number of skipped iterations: 15 | number of nan iterations: 0 | time (ms) | forward-compute: 1449.33 | forward-recv: 491.34 | backward-compute: 2749.45 | backward-send: 21.45 | backward-send-forward-recv: 182.16 | backward-params-all-reduce: 16.54 | backward-embedding-all-reduce: 920.04 | optimizer-copy-to-main-grad: 0.99 | optimizer-unscale-and-check-inf: 9.28 | optimizer-clip-main-grad: 1.05 | optimizer-copy-main-to-model-params: 0.62 | optimizer: 13.56 | batch-generator: 8.41 [Rank 25] (after 100 iterations) memory (MB) | allocated: 867.9521484375 | max allocated: 5951.7744140625 | reserved: 12040.0 | max reserved: 12040.0 [Rank 26] (after 100 iterations) memory (MB) | allocated: 867.9521484375 | max allocated: 5951.7744140625 | reserved: 12040.0 | max reserved: 12040.0 [Rank 27] (after 100 iterations) memory (MB) | allocated: 867.9521484375 | max allocated: 5951.7744140625 | reserved: 12040.0 | max reserved: 12040.0 [Rank 24] (after 100 iterations) memory (MB) | allocated: 867.9521484375 | max allocated: 5951.7744140625 | reserved: 11784.0 | max reserved: 11784.0 timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/06/16 01:39:45.004, Tesla V100-SXM2-32GB, 470.57.02, 76 %, 2 %, 32510 MiB, 19318 MiB, 13192 MiB 2022/06/16 01:39:45.005, Tesla V100-SXM2-32GB, 470.57.02, 76 %, 2 %, 32510 MiB, 19318 MiB, 13192 MiB 2022/06/16 01:39:45.005, Tesla V100-SXM2-32GB, 470.57.02, 76 %, 2 %, 32510 MiB, 19318 MiB, 13192 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/06/16 01:39:45.009, Tesla V100-SXM2-32GB, 470.57.02, 50 %, 2 %, 32510 MiB, 19032 MiB, 13478 MiB 2022/06/16 01:39:45.009, Tesla V100-SXM2-32GB, 470.57.02, 50 %, 2 %, 32510 MiB, 19032 MiB, 13478 MiB 2022/06/16 01:39:45.009, Tesla V100-SXM2-32GB, 470.57.02, 50 %, 2 %, 32510 MiB, 19032 MiB, 13478 MiB 2022/06/16 01:39:45.011, Tesla V100-SXM2-32GB, 470.57.02, 76 %, 2 %, 32510 MiB, 19318 MiB, 13192 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/06/16 01:39:45.012, Tesla V100-SXM2-32GB, 470.57.02, 75 %, 2 %, 32510 MiB, 19014 MiB, 13496 MiB 2022/06/16 01:39:45.012, Tesla V100-SXM2-32GB, 470.57.02, 75 %, 2 %, 32510 MiB, 19014 MiB, 13496 MiB 2022/06/16 01:39:45.013, Tesla V100-SXM2-32GB, 470.57.02, 75 %, 2 %, 32510 MiB, 19014 MiB, 13496 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/06/16 01:39:45.014, Tesla V100-SXM2-32GB, 470.57.02, 50 %, 2 %, 32510 MiB, 19032 MiB, 13478 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/06/16 01:39:45.015, Tesla V100-SXM2-32GB, 470.57.02, 58 %, 2 %, 32510 MiB, 19030 MiB, 13480 MiB 2022/06/16 01:39:45.015, Tesla V100-SXM2-32GB, 470.57.02, 76 %, 2 %, 32510 MiB, 19318 MiB, 13192 MiB 2022/06/16 01:39:45.015, Tesla V100-SXM2-32GB, 470.57.02, 58 %, 2 %, 32510 MiB, 19030 MiB, 13480 MiB 2022/06/16 01:39:45.016, Tesla V100-SXM2-32GB, 470.57.02, 58 %, 2 %, 32510 MiB, 19030 MiB, 13480 MiB 2022/06/16 01:39:45.016, Tesla V100-SXM2-32GB, 470.57.02, 76 %, 2 %, 32510 MiB, 19318 MiB, 13192 MiB 2022/06/16 01:39:45.018, Tesla V100-SXM2-32GB, 470.57.02, 75 %, 2 %, 32510 MiB, 19014 MiB, 13496 MiB 2022/06/16 01:39:45.019, Tesla V100-SXM2-32GB, 470.57.02, 76 %, 2 %, 32510 MiB, 19318 MiB, 13192 MiB 2022/06/16 01:39:45.020, Tesla V100-SXM2-32GB, 470.57.02, 26 %, 2 %, 32510 MiB, 19044 MiB, 13466 MiB 2022/06/16 01:39:45.020, Tesla V100-SXM2-32GB, 470.57.02, 50 %, 2 %, 32510 MiB, 19032 MiB, 13478 MiB 2022/06/16 01:39:45.021, Tesla V100-SXM2-32GB, 470.57.02, 26 %, 2 %, 32510 MiB, 19044 MiB, 13466 MiB 2022/06/16 01:39:45.021, Tesla V100-SXM2-32GB, 470.57.02, 26 %, 2 %, 32510 MiB, 19044 MiB, 13466 MiB 2022/06/16 01:39:45.022, Tesla V100-SXM2-32GB, 470.57.02, 50 %, 2 %, 32510 MiB, 19032 MiB, 13478 MiB 2022/06/16 01:39:45.025, Tesla V100-SXM2-32GB, 470.57.02, 58 %, 2 %, 32510 MiB, 19030 MiB, 13480 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/06/16 01:39:45.025, Tesla V100-SXM2-32GB, 470.57.02, 50 %, 2 %, 32510 MiB, 19032 MiB, 13478 MiB 2022/06/16 01:39:45.026, Tesla V100-SXM2-32GB, 470.57.02, 81 %, 2 %, 32510 MiB, 18864 MiB, 13646 MiB 2022/06/16 01:39:45.027, Tesla V100-SXM2-32GB, 470.57.02, 75 %, 2 %, 32510 MiB, 19014 MiB, 13496 MiB 2022/06/16 01:39:45.027, Tesla V100-SXM2-32GB, 470.57.02, 81 %, 2 %, 32510 MiB, 18864 MiB, 13646 MiB 2022/06/16 01:39:45.028, Tesla V100-SXM2-32GB, 470.57.02, 81 %, 2 %, 32510 MiB, 18864 MiB, 13646 MiB 2022/06/16 01:39:45.028, Tesla V100-SXM2-32GB, 470.57.02, 75 %, 2 %, 32510 MiB, 19014 MiB, 13496 MiB 2022/06/16 01:39:45.030, Tesla V100-SXM2-32GB, 470.57.02, 26 %, 2 %, 32510 MiB, 19044 MiB, 13466 MiB 2022/06/16 01:39:45.031, Tesla V100-SXM2-32GB, 470.57.02, 76 %, 2 %, 32510 MiB, 19318 MiB, 13192 MiB 2022/06/16 01:39:45.031, Tesla V100-SXM2-32GB, 470.57.02, 75 %, 2 %, 32510 MiB, 19014 MiB, 13496 MiB 2022/06/16 01:39:45.032, Tesla V100-SXM2-32GB, 470.57.02, 60 %, 2 %, 32510 MiB, 19046 MiB, 13464 MiB 2022/06/16 01:39:45.033, Tesla V100-SXM2-32GB, 470.57.02, 58 %, 2 %, 32510 MiB, 19030 MiB, 13480 MiB 2022/06/16 01:39:45.033, Tesla V100-SXM2-32GB, 470.57.02, 60 %, 2 %, 32510 MiB, 19046 MiB, 13464 MiB 2022/06/16 01:39:45.034, Tesla V100-SXM2-32GB, 470.57.02, 60 %, 2 %, 32510 MiB, 19046 MiB, 13464 MiB 2022/06/16 01:39:45.034, Tesla V100-SXM2-32GB, 470.57.02, 58 %, 2 %, 32510 MiB, 19030 MiB, 13480 MiB 2022/06/16 01:39:45.037, Tesla V100-SXM2-32GB, 470.57.02, 81 %, 2 %, 32510 MiB, 18864 MiB, 13646 MiB 2022/06/16 01:39:45.038, Tesla V100-SXM2-32GB, 470.57.02, 50 %, 2 %, 32510 MiB, 19032 MiB, 13478 MiB 2022/06/16 01:39:45.038, Tesla V100-SXM2-32GB, 470.57.02, 58 %, 2 %, 32510 MiB, 19030 MiB, 13480 MiB 2022/06/16 01:39:45.039, Tesla V100-SXM2-32GB, 470.57.02, 29 %, 2 %, 32510 MiB, 19034 MiB, 13476 MiB 2022/06/16 01:39:45.039, Tesla V100-SXM2-32GB, 470.57.02, 26 %, 2 %, 32510 MiB, 19044 MiB, 13466 MiB 2022/06/16 01:39:45.039, Tesla V100-SXM2-32GB, 470.57.02, 29 %, 2 %, 32510 MiB, 19034 MiB, 13476 MiB 2022/06/16 01:39:45.040, Tesla V100-SXM2-32GB, 470.57.02, 29 %, 2 %, 32510 MiB, 19034 MiB, 13476 MiB 2022/06/16 01:39:45.041, Tesla V100-SXM2-32GB, 470.57.02, 26 %, 2 %, 32510 MiB, 19044 MiB, 13466 MiB 2022/06/16 01:39:45.044, Tesla V100-SXM2-32GB, 470.57.02, 60 %, 2 %, 32510 MiB, 19046 MiB, 13464 MiB 2022/06/16 01:39:45.044, Tesla V100-SXM2-32GB, 470.57.02, 75 %, 2 %, 32510 MiB, 19014 MiB, 13496 MiB 2022/06/16 01:39:45.045, Tesla V100-SXM2-32GB, 470.57.02, 26 %, 2 %, 32510 MiB, 19044 MiB, 13466 MiB 2022/06/16 01:39:45.047, Tesla V100-SXM2-32GB, 470.57.02, 81 %, 2 %, 32510 MiB, 18864 MiB, 13646 MiB 2022/06/16 01:39:45.049, Tesla V100-SXM2-32GB, 470.57.02, 81 %, 2 %, 32510 MiB, 18864 MiB, 13646 MiB 2022/06/16 01:39:45.051, Tesla V100-SXM2-32GB, 470.57.02, 29 %, 2 %, 32510 MiB, 19034 MiB, 13476 MiB 2022/06/16 01:39:45.052, Tesla V100-SXM2-32GB, 470.57.02, 58 %, 2 %, 32510 MiB, 19030 MiB, 13480 MiB 2022/06/16 01:39:45.052, Tesla V100-SXM2-32GB, 470.57.02, 81 %, 2 %, 32510 MiB, 18864 MiB, 13646 MiB 2022/06/16 01:39:45.053, Tesla V100-SXM2-32GB, 470.57.02, 60 %, 2 %, 32510 MiB, 19046 MiB, 13464 MiB 2022/06/16 01:39:45.054, Tesla V100-SXM2-32GB, 470.57.02, 60 %, 2 %, 32510 MiB, 19046 MiB, 13464 MiB 2022/06/16 01:39:45.057, Tesla V100-SXM2-32GB, 470.57.02, 26 %, 2 %, 32510 MiB, 19044 MiB, 13466 MiB 2022/06/16 01:39:45.057, Tesla V100-SXM2-32GB, 470.57.02, 60 %, 2 %, 32510 MiB, 19046 MiB, 13464 MiB 2022/06/16 01:39:45.059, Tesla V100-SXM2-32GB, 470.57.02, 29 %, 2 %, 32510 MiB, 19034 MiB, 13476 MiB 2022/06/16 01:39:45.060, Tesla V100-SXM2-32GB, 470.57.02, 29 %, 2 %, 32510 MiB, 19034 MiB, 13476 MiB 2022/06/16 01:39:45.063, Tesla V100-SXM2-32GB, 470.57.02, 81 %, 2 %, 32510 MiB, 18864 MiB, 13646 MiB 2022/06/16 01:39:45.063, Tesla V100-SXM2-32GB, 470.57.02, 29 %, 2 %, 32510 MiB, 19034 MiB, 13476 MiB 2022/06/16 01:39:45.069, Tesla V100-SXM2-32GB, 470.57.02, 60 %, 2 %, 32510 MiB, 19046 MiB, 13464 MiB 2022/06/16 01:39:45.075, Tesla V100-SXM2-32GB, 470.57.02, 29 %, 2 %, 32510 MiB, 19034 MiB, 13476 MiB iteration 200/ 210 | consumed samples: 204800 | elapsed time per iteration (ms): 5745.9 | tpt: 178.2 samples/s | global batch size: 1024 | lm loss: 8.684506E+00 | loss scale: 262144.0 | grad norm: 1.725 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward-compute: 1433.63 | forward-recv: 427.85 | backward-compute: 2749.21 | backward-send: 21.57 | backward-send-forward-recv: 169.38 | backward-params-all-reduce: 16.55 | backward-embedding-all-reduce: 917.99 | optimizer-copy-to-main-grad: 1.37 | optimizer-unscale-and-check-inf: 0.78 | optimizer-clip-main-grad: 1.16 | optimizer-copy-main-to-model-params: 0.72 | optimizer: 5.55 | batch-generator: 8.05 ------------------------------------------------------------------------------------------------------------------ validation loss at the end of training for val data | lm loss value: 7.821787E+00 | lm loss PPL: 2.494360E+03 | ------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------- validation loss at the end of training for test data | lm loss value: 7.652918E+00 | lm loss PPL: 2.106785E+03 | ------------------------------------------------------------------------------------------------------------------- INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish. INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0011889934539794922 seconds {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 24, "group_rank": 3, "worker_id": "3378", "role": "default", "hostname": "iv-ebgyvncucvdxd0xrfapj", "state": "SUCCEEDED", "total_run_time": 1267, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 4, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [24], \"role_world_size\": [32]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 25, "group_rank": 3, "worker_id": "3379", "role": "default", "hostname": "iv-ebgyvncucvdxd0xrfapj", "state": "SUCCEEDED", "total_run_time": 1267, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 4, \"entry_point\": \"python\", \"local_rank\": [1], \"role_rank\": [25], \"role_world_size\": [32]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 26, "group_rank": 3, "worker_id": "3380", "role": "default", "hostname": "iv-ebgyvncucvdxd0xrfapj", "state": "SUCCEEDED", "total_run_time": 1267, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 4, \"entry_point\": \"python\", \"local_rank\": [2], \"role_rank\": [26], \"role_world_size\": [32]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 27, "group_rank": 3, "worker_id": "3381", "role": "default", "hostname": "iv-ebgyvncucvdxd0xrfapj", "state": "SUCCEEDED", "total_run_time": 1267, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 4, \"entry_point\": \"python\", \"local_rank\": [3], \"role_rank\": [27], \"role_world_size\": [32]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 28, "group_rank": 3, "worker_id": "3382", "role": "default", "hostname": "iv-ebgyvncucvdxd0xrfapj", "state": "SUCCEEDED", "total_run_time": 1267, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 4, \"entry_point\": \"python\", \"local_rank\": [4], \"role_rank\": [28], \"role_world_size\": [32]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 29, "group_rank": 3, "worker_id": "3383", "role": "default", "hostname": "iv-ebgyvncucvdxd0xrfapj", "state": "SUCCEEDED", "total_run_time": 1267, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 4, \"entry_point\": \"python\", \"local_rank\": [5], \"role_rank\": [29], \"role_world_size\": [32]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 30, "group_rank": 3, "worker_id": "3384", "role": "default", "hostname": "iv-ebgyvncucvdxd0xrfapj", "state": "SUCCEEDED", "total_run_time": 1267, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 4, \"entry_point\": \"python\", \"local_rank\": [6], \"role_rank\": [30], \"role_world_size\": [32]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 31, "group_rank": 3, "worker_id": "3385", "role": "default", "hostname": "iv-ebgyvncucvdxd0xrfapj", "state": "SUCCEEDED", "total_run_time": 1267, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 4, \"entry_point\": \"python\", \"local_rank\": [7], \"role_rank\": [31], \"role_world_size\": [32]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 3, "worker_id": null, "role": "default", "hostname": "iv-ebgyvncucvdxd0xrfapj", "state": "SUCCEEDED", "total_run_time": 1267, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 4, \"entry_point\": \"python\"}", "agent_restarts": 0}} ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. *****************************************