loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** loaded library: loaded library: loaded library: loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: loaded library: loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1 W20220506 21:44:20.928035 433 rpc_client.cpp:190] LoadServer 10.7.128.208 Failed at 0 times error_code 14 error_message failed to connect to all addresses ------------------------ arguments ------------------------ batches_per_epoch ............................... 312 channel_last .................................... True ddp ............................................. False exit_num ........................................ 300 fuse_bn_add_relu ................................ True fuse_bn_relu .................................... True gpu_stat_file ................................... None grad_clipping ................................... 0.0 graph ........................................... True label_smoothing ................................. 0.1 learning_rate ................................... 4.096 legacy_init ..................................... False load_path ....................................... None lr_decay_type ................................... cosine metric_local .................................... True metric_train_acc ................................ True momentum ........................................ 0.875 nccl_fusion_max_ops ............................. 24 nccl_fusion_threshold_mb ........................ 16 num_classes ..................................... 1000 num_devices_per_node ............................ 8 num_epochs ...................................... 1 num_nodes ....................................... 1 ofrecord_part_num ............................... 256 ofrecord_path ................................... /dataset/79846248 print_interval .................................. 100 print_timestamp ................................. False samples_per_epoch ............................... 1281167 save_init ....................................... False save_path ....................................... None scale_grad ...................................... True skip_eval ....................................... True synthetic_data .................................. False total_batches ................................... -1 train_batch_size ................................ 512 train_global_batch_size ......................... 4096 use_fp16 ........................................ True use_gpu_decode .................................. True val_batch_size .................................. 50 val_batches_per_epoch ........................... 125 val_global_batch_size ........................... 400 val_samples_per_epoch ........................... 50000 warmup_epochs ................................... 5 weight_decay .................................... 3.0517578125e-05 zero_init_residual .............................. True -------------------- end of arguments --------------------- ***** Model Init ***** ***** Model Init Finish, time escapled: 3.13603 s ***** [rank:6] [train], epoch: 0/1, iter: 100/312, loss: 0.86787, top1: 0.00105, throughput: 424.53 | 2022-05-06 21:46:37.128 [rank:3] [train], epoch: 0/1, iter: 100/312, loss: 0.86819, top1: 0.00076, throughput: 424.51 | 2022-05-06 21:46:37.129 [rank:2] [train], epoch: 0/1, iter: 100/312, loss: 0.86780, top1: 0.00092, throughput: 424.53 | 2022-05-06 21:46:37.129 [rank:0] [train], epoch: 0/1, iter: 100/312, loss: 0.86801, top1: 0.00086, throughput: 424.54 [rank:4] [train], epoch: 0/1, iter: 100/312, loss: 0.86789, top1: 0.00090, throughput: 424.53| 2022-05-06 21:46:37.128 | 2022-05-06 21:46:37.129 [rank:7] [train], epoch: 0/1, iter: 100/312, loss: 0.86783, top1: 0.00084, throughput: 424.51 | 2022-05-06 21:46:37.128 [rank:1] [train], epoch: 0/1, iter: 100/312, loss: 0.86778, top1: 0.00100, throughput: 424.54[rank:5] [train], epoch: 0/1, iter: 100/312, loss: 0.86785, top1: 0.00100, throughput: 424.51 | 2022-05-06 21:46:37.129 | 2022-05-06 21:46:37.128 timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/05/06 21:46:37.362, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 70 %, 32510 MiB, 7805 MiB, 24705 MiB 2022/05/06 21:46:37.368, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 70 %, 32510 MiB, 7805 MiB, 24705 MiB 2022/05/06 21:46:37.370, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 70 %, 32510 MiB, 7805 MiB, 24705 MiB 2022/05/06 21:46:37.370, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 70 %, 32510 MiB, 7805 MiB, 24705 MiB 2022/05/06 21:46:37.377, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 70 %, 32510 MiB, 7805 MiB, 24705 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/05/06 21:46:37.379, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 88 %, 32510 MiB, 7792 MiB, 24718 MiB 2022/05/06 21:46:37.380, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 88 %, 32510 MiB, 7792 MiB, 24718 MiB 2022/05/06 21:46:37.381, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 88 %, 32510 MiB, 7792 MiB, 24718 MiB 2022/05/06 21:46:37.382, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 88 %, 32510 MiB, 7792 MiB, 24718 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/05/06 21:46:37.389, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 88 %, 32510 MiB, 7792 MiB, 24718 MiB 2022/05/06 21:46:37.389, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 70 %, 32510 MiB, 7805 MiB, 24705 MiB 2022/05/06 21:46:37.390, Tesla V100-SXM2-32GB, 470.57.02, 94 %, 83 %, 32510 MiB, 7966 MiB, 24544 MiB 2022/05/06 21:46:37.390, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 70 %, 32510 MiB, 7805 MiB, 24705 MiB 2022/05/06 21:46:37.391, Tesla V100-SXM2-32GB, 470.57.02, 94 %, 83 %, 32510 MiB, 7966 MiB, 24544 MiB 2022/05/06 21:46:37.392, Tesla V100-SXM2-32GB, 470.57.02, 94 %, 83 %, 32510 MiB, 7966 MiB, 24544 MiB 2022/05/06 21:46:37.392, Tesla V100-SXM2-32GB, 470.57.02, 94 %, 83 %, 32510 MiB, 7966 MiB, 24544 MiB 2022/05/06 21:46:37.395, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 70 %, 32510 MiB, 7805 MiB, 24705 MiB 2022/05/06 21:46:37.396, Tesla V100-SXM2-32GB, 470.57.02, 94 %, 83 %, 32510 MiB, 7966 MiB, 24544 MiB 2022/05/06 21:46:37.397, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 88 %, 32510 MiB, 7792 MiB, 24718 MiB 2022/05/06 21:46:37.398, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7918 MiB, 24592 MiB 2022/05/06 21:46:37.398, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 88 %, 32510 MiB, 7792 MiB, 24718 MiB 2022/05/06 21:46:37.399, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7918 MiB, 24592 MiB 2022/05/06 21:46:37.400, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7918 MiB, 24592 MiB 2022/05/06 21:46:37.400, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7918 MiB, 24592 MiB 2022/05/06 21:46:37.404, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 88 %, 32510 MiB, 7792 MiB, 24718 MiB 2022/05/06 21:46:37.405, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7918 MiB, 24592 MiB 2022/05/06 21:46:37.406, Tesla V100-SXM2-32GB, 470.57.02, 94 %, 83 %, 32510 MiB, 7966 MiB, 24544 MiB 2022/05/06 21:46:37.406, Tesla V100-SXM2-32GB, 470.57.02, 89 %, 80 %, 32510 MiB, 7924 MiB, 24586 MiB 2022/05/06 21:46:37.407, Tesla V100-SXM2-32GB, 470.57.02, 94 %, 83 %, 32510 MiB, 7966 MiB, 24544 MiB 2022/05/06 21:46:37.407, Tesla V100-SXM2-32GB, 470.57.02, 89 %, 80 %, 32510 MiB, 7924 MiB, 24586 MiB 2022/05/06 21:46:37.408, Tesla V100-SXM2-32GB, 470.57.02, 89 %, 80 %, 32510 MiB, 7924 MiB, 24586 MiB 2022/05/06 21:46:37.409, Tesla V100-SXM2-32GB, 470.57.02, 89 %, 80 %, 32510 MiB, 7924 MiB, 24586 MiB 2022/05/06 21:46:37.412, Tesla V100-SXM2-32GB, 470.57.02, 94 %, 83 %, 32510 MiB, 7966 MiB, 24544 MiB 2022/05/06 21:46:37.413, Tesla V100-SXM2-32GB, 470.57.02, 89 %, 80 %, 32510 MiB, 7924 MiB, 24586 MiB 2022/05/06 21:46:37.414, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7918 MiB, 24592 MiB 2022/05/06 21:46:37.415, Tesla V100-SXM2-32GB, 470.57.02, 75 %, 67 %, 32510 MiB, 7796 MiB, 24714 MiB 2022/05/06 21:46:37.415, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7918 MiB, 24592 MiB 2022/05/06 21:46:37.415, Tesla V100-SXM2-32GB, 470.57.02, 75 %, 67 %, 32510 MiB, 7796 MiB, 24714 MiB 2022/05/06 21:46:37.417, Tesla V100-SXM2-32GB, 470.57.02, 75 %, 67 %, 32510 MiB, 7796 MiB, 24714 MiB 2022/05/06 21:46:37.417, Tesla V100-SXM2-32GB, 470.57.02, 75 %, 67 %, 32510 MiB, 7796 MiB, 24714 MiB 2022/05/06 21:46:37.421, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7918 MiB, 24592 MiB 2022/05/06 21:46:37.421, Tesla V100-SXM2-32GB, 470.57.02, 75 %, 67 %, 32510 MiB, 7796 MiB, 24714 MiB 2022/05/06 21:46:37.423, Tesla V100-SXM2-32GB, 470.57.02, 89 %, 80 %, 32510 MiB, 7924 MiB, 24586 MiB 2022/05/06 21:46:37.423, Tesla V100-SXM2-32GB, 470.57.02, 93 %, 83 %, 32510 MiB, 7700 MiB, 24810 MiB 2022/05/06 21:46:37.423, Tesla V100-SXM2-32GB, 470.57.02, 89 %, 80 %, 32510 MiB, 7924 MiB, 24586 MiB 2022/05/06 21:46:37.424, Tesla V100-SXM2-32GB, 470.57.02, 93 %, 83 %, 32510 MiB, 7700 MiB, 24810 MiB 2022/05/06 21:46:37.425, Tesla V100-SXM2-32GB, 470.57.02, 93 %, 83 %, 32510 MiB, 7700 MiB, 24810 MiB 2022/05/06 21:46:37.425, Tesla V100-SXM2-32GB, 470.57.02, 93 %, 83 %, 32510 MiB, 7700 MiB, 24810 MiB 2022/05/06 21:46:37.428, Tesla V100-SXM2-32GB, 470.57.02, 89 %, 80 %, 32510 MiB, 7924 MiB, 24586 MiB 2022/05/06 21:46:37.429, Tesla V100-SXM2-32GB, 470.57.02, 93 %, 83 %, 32510 MiB, 7700 MiB, 24810 MiB 2022/05/06 21:46:37.430, Tesla V100-SXM2-32GB, 470.57.02, 75 %, 67 %, 32510 MiB, 7796 MiB, 24714 MiB 2022/05/06 21:46:37.431, Tesla V100-SXM2-32GB, 470.57.02, 82 %, 74 %, 32510 MiB, 7854 MiB, 24656 MiB 2022/05/06 21:46:37.431, Tesla V100-SXM2-32GB, 470.57.02, 75 %, 67 %, 32510 MiB, 7796 MiB, 24714 MiB 2022/05/06 21:46:37.431, Tesla V100-SXM2-32GB, 470.57.02, 82 %, 74 %, 32510 MiB, 7854 MiB, 24656 MiB 2022/05/06 21:46:37.432, Tesla V100-SXM2-32GB, 470.57.02, 82 %, 74 %, 32510 MiB, 7854 MiB, 24656 MiB 2022/05/06 21:46:37.433, Tesla V100-SXM2-32GB, 470.57.02, 82 %, 74 %, 32510 MiB, 7854 MiB, 24656 MiB 2022/05/06 21:46:37.436, Tesla V100-SXM2-32GB, 470.57.02, 75 %, 67 %, 32510 MiB, 7796 MiB, 24714 MiB 2022/05/06 21:46:37.437, Tesla V100-SXM2-32GB, 470.57.02, 82 %, 74 %, 32510 MiB, 7854 MiB, 24656 MiB 2022/05/06 21:46:37.438, Tesla V100-SXM2-32GB, 470.57.02, 93 %, 83 %, 32510 MiB, 7700 MiB, 24810 MiB 2022/05/06 21:46:37.438, Tesla V100-SXM2-32GB, 470.57.02, 93 %, 83 %, 32510 MiB, 7700 MiB, 24810 MiB 2022/05/06 21:46:37.443, Tesla V100-SXM2-32GB, 470.57.02, 93 %, 83 %, 32510 MiB, 7700 MiB, 24810 MiB 2022/05/06 21:46:37.444, Tesla V100-SXM2-32GB, 470.57.02, 82 %, 74 %, 32510 MiB, 7854 MiB, 24656 MiB 2022/05/06 21:46:37.445, Tesla V100-SXM2-32GB, 470.57.02, 82 %, 74 %, 32510 MiB, 7854 MiB, 24656 MiB 2022/05/06 21:46:37.448, Tesla V100-SXM2-32GB, 470.57.02, 82 %, 74 %, 32510 MiB, 7854 MiB, 24656 MiB [rank:2] [train], epoch: 0/1, iter: 200/312, loss: 0.86780, top1: 0.00088, throughput: 1331.31 | 2022-05-06 21:47:15.588 [rank:6] [train], epoch: 0/1, iter: 200/312, loss: 0.86775, top1: 0.00096, throughput: 1331.21 | 2022-05-06 21:47:15.589 [rank:7] [train], epoch: 0/1, iter: 200/312, loss: 0.86781, top1: 0.00070, throughput: 1331.22 | 2022-05-06 21:47:15.589 [rank:5] [train], epoch: 0/1, iter: 200/312, loss: 0.86803, top1: 0.00066, throughput: 1331.23 | 2022-05-06 21:47:15.590 [rank:1] [train], epoch: 0/1, iter: 200/312, loss: 0.86771, top1: 0.00086, throughput: 1331.20 | 2022-05-06 21:47:15.589 [rank:4] [train], epoch: 0/1, iter: 200/312, loss: 0.86770, top1: 0.00090, throughput: 1331.23 | 2022-05-06 21:47:15.590 [rank:3] [train], epoch: 0/1, iter: 200/312, loss: 0.86790, top1: 0.00102, throughput: 1331.20 | 2022-05-06 21:47:15.591 [rank:0] [train], epoch: 0/1, iter: 200/312, loss: 0.86792, top1: 0.00078, throughput: 1331.17 | 2022-05-06 21:47:15.590 [rank:2] [train], epoch: 0/1, iter: 300/312, loss: 0.86780, top1: 0.00086, throughput: 1353.69 | 2022-05-06 21:47:53.410 [rank:3] [train], epoch: 0/1, iter: 300/312, loss: 0.86788, top1: 0.00096, throughput: 1353.80 | 2022-05-06 21:47:53.410 [rank:1] [train], epoch: 0/1, iter: 300/312, loss: 0.86769, top1: 0.00064, throughput: 1353.75 | 2022-05-06 21:47:53.410 [rank:6] [train], epoch: 0/1, iter: 300/312, loss: 0.86773, top1: 0.00078, throughput: 1353.72 | 2022-05-06 21:47:53.411 [rank:5] [train], epoch: 0/1, iter: 300/312, loss: 0.86783, top1: 0.00107, throughput: 1353.75 | 2022-05-06 21:47:53.410 [rank:7] [train], epoch: 0/1, iter: 300/312, loss: 0.86751, top1: 0.00084, throughput: 1353.74 | 2022-05-06 21:47:53.410 [rank:0] [train], epoch: 0/1, iter: 300/312, loss: 0.86761, top1: 0.00092, throughput: 1353.76 | 2022-05-06 21:47:53.411 [rank:4] [train], epoch: 0/1, iter: 300/312, loss: 0.86796, top1: 0.00074, throughput: 1353.13 | 2022-05-06 21:47:53.428