loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** loaded library: loaded library: loaded library: loaded library: loaded library: loaded library: loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1 /usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1 /usr/lib/x86_64-linux-gnu/libibverbs.so.1loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1 /usr/lib/x86_64-linux-gnu/libibverbs.so.1 W20220410 22:52:30.552606 6934 rpc_client.cpp:190] LoadServer 10.7.202.19 Failed at 0 times error_code 14 error_message failed to connect to all addresses ------------------------ arguments ------------------------ batches_per_epoch ............................... 625 channel_last .................................... False ddp ............................................. True exit_num ........................................ 300 fuse_bn_add_relu ................................ False fuse_bn_relu .................................... False gpu_stat_file ................................... None grad_clipping ................................... 0.0 graph ........................................... False label_smoothing ................................. 0.1 learning_rate ................................... 2.048 legacy_init ..................................... False load_path ....................................... None lr_decay_type ................................... cosine metric_local .................................... True metric_train_acc ................................ True momentum ........................................ 0.875 nccl_fusion_max_ops ............................. 24 nccl_fusion_threshold_mb ........................ 16 num_classes ..................................... 1000 num_devices_per_node ............................ 8 num_epochs ...................................... 1 num_nodes ....................................... 1 ofrecord_part_num ............................... 256 ofrecord_path ................................... /dataset/79846248 print_interval .................................. 100 print_timestamp ................................. False samples_per_epoch ............................... 1281167 save_init ....................................... False save_path ....................................... None scale_grad ...................................... False skip_eval ....................................... True synthetic_data .................................. False total_batches ................................... -1 train_batch_size ................................ 256 train_global_batch_size ......................... 2048 use_fp16 ........................................ False use_gpu_decode .................................. False val_batch_size .................................. 50 val_batches_per_epoch ........................... 125 val_global_batch_size ........................... 400 val_samples_per_epoch ........................... 50000 warmup_epochs ................................... 5 weight_decay .................................... 3.0517578125e-05 zero_init_residual .............................. True -------------------- end of arguments --------------------- ***** Model Init ***** ***** Model Init Finish, time escapled: 2.52429 s ***** W20220410 22:52:52.138615 7832 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=0, require memory=0, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (3) requires memory 1520566288 W20220410 22:52:52.144766 7628 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=0, require memory=0, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (3) requires memory 1520566288 W20220410 22:52:52.152626 7118 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=0, require memory=0, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (3) requires memory 1520566288 W20220410 22:52:52.148108 7526 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=0, require memory=0, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (3) requires memory 1520566288 W20220410 22:52:52.149274 7322 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=0, require memory=0, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (3) requires memory 1520566288 W20220410 22:52:52.155409 7220 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=0, require memory=0, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (3) requires memory 1520566288 W20220410 22:52:52.156186 7424 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=0, require memory=0, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (3) requires memory 1520566288 W20220410 22:52:52.152506 7730 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=0, require memory=0, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (3) requires memory 1520566288 [rank:1] [train], epoch: 0/1, iter: 100/625, loss: 0.86727, lr: 0.000000, top1: 0.00148, throughput: 293.63 | 2022-04-10 22:54:12.586 [rank:6] [train], epoch: 0/1, iter: 100/625, loss: 0.86722, lr: 0.000000, top1: 0.00109, throughput: 293.70 | 2022-04-10 22:54:12.611 [rank:2] [train], epoch: 0/1, iter: 100/625, loss: 0.86748, lr: 0.000000, top1: 0.00098, throughput: 293.58 | 2022-04-10 22:54:12.610 [rank:4] [train], epoch: 0/1, iter: 100/625, loss: 0.86701, lr: 0.000000, top1: 0.00094, throughput: 293.73 | 2022-04-10 22:54:12.628 [rank:5] [train], epoch: 0/1, iter: 100/625, loss: 0.86737, lr: 0.000000, top1: 0.00113, throughput: 293.73 | 2022-04-10 22:54:12.638 [rank:3] [train], epoch: 0/1, iter: 100/625, loss: 0.86743, lr: 0.000000, top1: 0.00121, throughput: 293.62 | 2022-04-10 22:54:12.648 [rank:0] [train], epoch: 0/1, iter: 100/625, loss: 0.86729, lr: 0.000000, top1: 0.00098, throughput: 293.62 | 2022-04-10 22:54:12.650 [rank:7] [train], epoch: 0/1, iter: 100/625, loss: 0.86753, lr: 0.000000, top1: 0.00102, throughput: 293.63 | 2022-04-10 22:54:12.667 timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/04/10 22:54:12.795, Tesla V100-SXM2-32GB, 470.57.02, 33 %, 20 %, 32510 MiB, 5184 MiB, 27326 MiB 2022/04/10 22:54:12.802, Tesla V100-SXM2-32GB, 470.57.02, 71 %, 30 %, 32510 MiB, 5204 MiB, 27306 MiB 2022/04/10 22:54:12.807, Tesla V100-SXM2-32GB, 470.57.02, 49 %, 8 %, 32510 MiB, 5280 MiB, 27230 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/04/10 22:54:12.811, Tesla V100-SXM2-32GB, 470.57.02, 34 %, 23 %, 32510 MiB, 5256 MiB, 27254 MiB 2022/04/10 22:54:12.815, Tesla V100-SXM2-32GB, 470.57.02, 33 %, 20 %, 32510 MiB, 5184 MiB, 27326 MiB 2022/04/10 22:54:12.818, Tesla V100-SXM2-32GB, 470.57.02, 57 %, 28 %, 32510 MiB, 5264 MiB, 27246 MiB 2022/04/10 22:54:12.823, Tesla V100-SXM2-32GB, 470.57.02, 48 %, 25 %, 32510 MiB, 5204 MiB, 27306 MiB 2022/04/10 22:54:12.825, Tesla V100-SXM2-32GB, 470.57.02, 33 %, 10 %, 32510 MiB, 5196 MiB, 27314 MiB 2022/04/10 22:54:12.836, Tesla V100-SXM2-32GB, 470.57.02, 49 %, 8 %, 32510 MiB, 5280 MiB, 27230 MiB 2022/04/10 22:54:12.838, Tesla V100-SXM2-32GB, 470.57.02, 89 %, 45 %, 32510 MiB, 5140 MiB, 27370 MiB 2022/04/10 22:54:12.843, Tesla V100-SXM2-32GB, 470.57.02, 77 %, 56 %, 32510 MiB, 5256 MiB, 27254 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/04/10 22:54:12.844, Tesla V100-SXM2-32GB, 470.57.02, 65 %, 54 %, 32510 MiB, 5220 MiB, 27290 MiB 2022/04/10 22:54:12.855, Tesla V100-SXM2-32GB, 470.57.02, 57 %, 28 %, 32510 MiB, 5264 MiB, 27246 MiB 2022/04/10 22:54:12.857, Tesla V100-SXM2-32GB, 470.57.02, 69 %, 54 %, 32510 MiB, 5184 MiB, 27326 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/04/10 22:54:12.862, Tesla V100-SXM2-32GB, 470.57.02, 84 %, 69 %, 32510 MiB, 5196 MiB, 27314 MiB 2022/04/10 22:54:12.864, Tesla V100-SXM2-32GB, 470.57.02, 48 %, 25 %, 32510 MiB, 5204 MiB, 27306 MiB 2022/04/10 22:54:12.867, Tesla V100-SXM2-32GB, 470.57.02, 69 %, 53 %, 32510 MiB, 5184 MiB, 27326 MiB 2022/04/10 22:54:12.868, Tesla V100-SXM2-32GB, 470.57.02, 89 %, 45 %, 32510 MiB, 5140 MiB, 27370 MiB 2022/04/10 22:54:12.883, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 84 %, 32510 MiB, 5280 MiB, 27230 MiB 2022/04/10 22:54:12.927, Tesla V100-SXM2-32GB, 470.57.02, 48 %, 24 %, 32510 MiB, 5204 MiB, 27306 MiB 2022/04/10 22:54:12.928, Tesla V100-SXM2-32GB, 470.57.02, 65 %, 53 %, 32510 MiB, 5220 MiB, 27290 MiB 2022/04/10 22:54:12.940, Tesla V100-SXM2-32GB, 470.57.02, 77 %, 55 %, 32510 MiB, 5256 MiB, 27254 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/04/10 22:54:12.967, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 84 %, 32510 MiB, 5280 MiB, 27230 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/04/10 22:54:12.982, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 75 %, 32510 MiB, 5264 MiB, 27246 MiB 2022/04/10 22:54:12.988, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 67 %, 32510 MiB, 5184 MiB, 27326 MiB 2022/04/10 22:54:12.988, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 67 %, 32510 MiB, 5256 MiB, 27254 MiB 2022/04/10 22:54:12.990, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 67 %, 32510 MiB, 5184 MiB, 27326 MiB 2022/04/10 22:54:13.006, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 67 %, 32510 MiB, 5184 MiB, 27326 MiB 2022/04/10 22:54:13.007, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 64 %, 32510 MiB, 5196 MiB, 27314 MiB 2022/04/10 22:54:13.006, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 67 %, 32510 MiB, 5184 MiB, 27326 MiB 2022/04/10 22:54:13.018, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 76 %, 32510 MiB, 5204 MiB, 27306 MiB 2022/04/10 22:54:13.018, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 75 %, 32510 MiB, 5264 MiB, 27246 MiB 2022/04/10 22:54:13.034, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 76 %, 32510 MiB, 5204 MiB, 27306 MiB 2022/04/10 22:54:13.038, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 76 %, 32510 MiB, 5204 MiB, 27306 MiB 2022/04/10 22:54:13.039, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 70 %, 32510 MiB, 5140 MiB, 27370 MiB 2022/04/10 22:54:13.039, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 76 %, 32510 MiB, 5204 MiB, 27306 MiB 2022/04/10 22:54:13.042, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 59 %, 32510 MiB, 5280 MiB, 27230 MiB 2022/04/10 22:54:13.043, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 64 %, 32510 MiB, 5196 MiB, 27314 MiB 2022/04/10 22:54:13.044, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 59 %, 32510 MiB, 5280 MiB, 27230 MiB 2022/04/10 22:54:13.046, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 59 %, 32510 MiB, 5280 MiB, 27230 MiB 2022/04/10 22:54:13.046, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 67 %, 32510 MiB, 5220 MiB, 27290 MiB 2022/04/10 22:54:13.047, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 59 %, 32510 MiB, 5280 MiB, 27230 MiB 2022/04/10 22:54:13.062, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 67 %, 32510 MiB, 5256 MiB, 27254 MiB 2022/04/10 22:54:13.062, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 70 %, 32510 MiB, 5140 MiB, 27370 MiB 2022/04/10 22:54:13.064, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 67 %, 32510 MiB, 5256 MiB, 27254 MiB 2022/04/10 22:54:13.066, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 67 %, 32510 MiB, 5256 MiB, 27254 MiB 2022/04/10 22:54:13.066, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 67 %, 32510 MiB, 5256 MiB, 27254 MiB 2022/04/10 22:54:13.072, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 75 %, 32510 MiB, 5264 MiB, 27246 MiB 2022/04/10 22:54:13.073, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 67 %, 32510 MiB, 5220 MiB, 27290 MiB 2022/04/10 22:54:13.078, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 75 %, 32510 MiB, 5264 MiB, 27246 MiB 2022/04/10 22:54:13.080, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 75 %, 32510 MiB, 5264 MiB, 27246 MiB 2022/04/10 22:54:13.081, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 75 %, 32510 MiB, 5264 MiB, 27246 MiB 2022/04/10 22:54:13.084, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 64 %, 32510 MiB, 5196 MiB, 27314 MiB 2022/04/10 22:54:13.085, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 64 %, 32510 MiB, 5196 MiB, 27314 MiB 2022/04/10 22:54:13.087, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 64 %, 32510 MiB, 5196 MiB, 27314 MiB 2022/04/10 22:54:13.089, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 64 %, 32510 MiB, 5196 MiB, 27314 MiB 2022/04/10 22:54:13.093, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 70 %, 32510 MiB, 5140 MiB, 27370 MiB 2022/04/10 22:54:13.097, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 70 %, 32510 MiB, 5140 MiB, 27370 MiB 2022/04/10 22:54:13.099, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 70 %, 32510 MiB, 5140 MiB, 27370 MiB 2022/04/10 22:54:13.099, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 70 %, 32510 MiB, 5140 MiB, 27370 MiB 2022/04/10 22:54:13.102, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 67 %, 32510 MiB, 5220 MiB, 27290 MiB 2022/04/10 22:54:13.104, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 67 %, 32510 MiB, 5220 MiB, 27290 MiB 2022/04/10 22:54:13.109, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 67 %, 32510 MiB, 5220 MiB, 27290 MiB 2022/04/10 22:54:13.109, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 67 %, 32510 MiB, 5220 MiB, 27290 MiB [rank:0] [train], epoch: 0/1, iter: 200/625, loss: 0.86749, lr: 0.000000, top1: 0.00105, throughput: 319.97 | 2022-04-10 22:55:32.658 [rank:7] [train], epoch: 0/1, iter: 200/625, loss: 0.86722, lr: 0.000000, top1: 0.00102, throughput: 320.03 | 2022-04-10 22:55:32.658 [rank:1] [train], epoch: 0/1, iter: 200/625, loss: 0.86709, lr: 0.000000, top1: 0.00152, throughput: 319.69 | 2022-04-10 22:55:32.665 [rank:3] [train], epoch: 0/1, iter: 200/625, loss: 0.86731, lr: 0.000000, top1: 0.00125, throughput: 319.83 | 2022-04-10 22:55:32.692 [rank:6] [train], epoch: 0/1, iter: 200/625, loss: 0.86732, lr: 0.000000, top1: 0.00121, throughput: 319.67 | 2022-04-10 22:55:32.693 [rank:5] [train], epoch: 0/1, iter: 200/625, loss: 0.86731, lr: 0.000000, top1: 0.00105, throughput: 319.75 | 2022-04-10 22:55:32.699 [rank:2] [train], epoch: 0/1, iter: 200/625, loss: 0.86754, lr: 0.000000, top1: 0.00117, throughput: 319.57 | 2022-04-10 22:55:32.718 [rank:4] [train], epoch: 0/1, iter: 200/625, loss: 0.86746, lr: 0.000000, top1: 0.00141, throughput: 319.60 | 2022-04-10 22:55:32.727 [rank:5] [train], epoch: 0/1, iter: 300/625, loss: 0.86732, lr: 0.000000, top1: 0.00160, throughput: 320.64 | 2022-04-10 22:56:52.540 [rank:3] [train], epoch: 0/1, iter: 300/625, loss: 0.86746, lr: 0.000000, top1: 0.00137, throughput: 320.55 | 2022-04-10 22:56:52.555 [rank:6] [train], epoch: 0/1, iter: 300/625, loss: 0.86735, lr: 0.000000, top1: 0.00109, throughput: 320.53 | 2022-04-10 22:56:52.562 [rank:0] [train], epoch: 0/1, iter: 300/625, loss: 0.86711, lr: 0.000000, top1: 0.00109, throughput: 320.38 | 2022-04-10 22:56:52.564 [rank:7] [train], epoch: 0/1, iter: 300/625, loss: 0.86749, lr: 0.000000, top1: 0.00117, throughput: 320.32 | 2022-04-10 22:56:52.579 [rank:4] [train], epoch: 0/1, iter: 300/625, loss: 0.86759, lr: 0.000000, top1: 0.00082, throughput: 320.57 | 2022-04-10 22:56:52.585 [rank:2] [train], epoch: 0/1, iter: 300/625, loss: 0.86756, lr: 0.000000, top1: 0.00105, throughput: 320.52 | 2022-04-10 22:56:52.587 [rank:1] [train], epoch: 0/1, iter: 300/625, loss: 0.86712, lr: 0.000000, top1: 0.00133, throughput: 320.24 | 2022-04-10 22:56:52.605