loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 W20220513 09:59:16.472806 404 rpc_client.cpp:190] LoadServer 10.7.101.119 Failed at 0 times error_code 14 error_message failed to connect to all addresses W20220513 09:59:16.474586 405 rpc_client.cpp:190] LoadServer 10.7.101.119 Failed at 0 times error_code 14 error_message failed to connect to all addresses W20220513 09:59:16.477113 406 rpc_client.cpp:190] LoadServer 10.7.101.119 Failed at 0 times error_code 14 error_message failed to connect to all addresses W20220513 09:59:16.478163 407 rpc_client.cpp:190] LoadServer 10.7.101.119 Failed at 0 times error_code 14 error_message failed to connect to all addresses ------------------------ arguments ------------------------ batches_per_epoch ............................... 312 channel_last .................................... True ddp ............................................. False exit_num ........................................ 300 fuse_bn_add_relu ................................ True fuse_bn_relu .................................... True gpu_stat_file ................................... None grad_clipping ................................... 0.0 graph ........................................... True label_smoothing ................................. 0.1 learning_rate ................................... 4.096 legacy_init ..................................... False load_path ....................................... None lr_decay_type ................................... cosine metric_local .................................... True metric_train_acc ................................ True momentum ........................................ 0.875 nccl_fusion_max_ops ............................. 24 nccl_fusion_threshold_mb ........................ 16 num_classes ..................................... 1000 num_devices_per_node ............................ 8 num_epochs ...................................... 1 num_nodes ....................................... 1 ofrecord_part_num ............................... 256 ofrecord_path ................................... /dataset/79846248 print_interval .................................. 100 print_timestamp ................................. False samples_per_epoch ............................... 1281167 save_init ....................................... False save_path ....................................... None scale_grad ...................................... True skip_eval ....................................... True synthetic_data .................................. False total_batches ................................... -1 train_batch_size ................................ 512 train_global_batch_size ......................... 4096 use_fp16 ........................................ True use_gpu_decode .................................. True val_batch_size .................................. 50 val_batches_per_epoch ........................... 125 val_global_batch_size ........................... 400 val_samples_per_epoch ........................... 50000 warmup_epochs ................................... 5 weight_decay .................................... 3.0517578125e-05 zero_init_residual .............................. True -------------------- end of arguments --------------------- ***** Model Init ***** ***** Model Init Finish, time escapled: 2.80703 s ***** [rank:3] [train], epoch: 0/1, iter: 100/312, loss: 0.86783, top1: 0.00092, throughput: 421.68 | 2022-05-13 10:01:33.000 [rank:5] [train], epoch: 0/1, iter: 100/312, loss: 0.86738, top1: 0.00102, throughput: 421.70 | 2022-05-13 10:01:33.000 [rank:7] [train], epoch: 0/1, iter: 100/312, loss: 0.86785, top1: 0.00068, throughput: 421.65 | 2022-05-13 10:01:33.000 [rank:4] [train], epoch: 0/1, iter: 100/312, loss: 0.86755, top1: 0.00104, throughput: 421.66 | 2022-05-13 10:01:33.001 [rank:0] [train], epoch: 0/1, iter: 100/312, loss: 0.86763, top1: 0.00125, throughput: 421.66 | 2022-05-13 10:01:33.001 [rank:2] [train], epoch: 0/1, iter: 100/312, loss: 0.86753, top1: 0.00076, throughput: 421.66 | 2022-05-13 10:01:33.003 [rank:1] [train], epoch: 0/1, iter: 100/312, loss: 0.86788, top1: 0.00113, throughput: 421.64 | 2022-05-13 10:01:33.000 [rank:6] [train], epoch: 0/1, iter: 100/312, loss: 0.86781, top1: 0.00113, throughput: 421.55 | 2022-05-13 10:01:33.034 timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/05/13 10:01:33.166, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 56 %, 32510 MiB, 7749 MiB, 24761 MiB 2022/05/13 10:01:33.186, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 59 %, 32510 MiB, 7790 MiB, 24720 MiB 2022/05/13 10:01:33.192, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 50 %, 32510 MiB, 7942 MiB, 24568 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/05/13 10:01:33.198, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 47 %, 32510 MiB, 7916 MiB, 24594 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/05/13 10:01:33.201, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 56 %, 32510 MiB, 7749 MiB, 24761 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/05/13 10:01:33.203, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 59 %, 32510 MiB, 7910 MiB, 24600 MiB 2022/05/13 10:01:33.202, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 56 %, 32510 MiB, 7749 MiB, 24761 MiB 2022/05/13 10:01:33.203, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 56 %, 32510 MiB, 7749 MiB, 24761 MiB 2022/05/13 10:01:33.204, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 56 %, 32510 MiB, 7749 MiB, 24761 MiB 2022/05/13 10:01:33.208, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 59 %, 32510 MiB, 7790 MiB, 24720 MiB 2022/05/13 10:01:33.210, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 57 %, 32510 MiB, 7808 MiB, 24702 MiB 2022/05/13 10:01:33.210, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 59 %, 32510 MiB, 7790 MiB, 24720 MiB 2022/05/13 10:01:33.210, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 56 %, 32510 MiB, 7749 MiB, 24761 MiB 2022/05/13 10:01:33.211, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 59 %, 32510 MiB, 7790 MiB, 24720 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/05/13 10:01:33.212, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 83 %, 32510 MiB, 7790 MiB, 24720 MiB 2022/05/13 10:01:33.216, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 50 %, 32510 MiB, 7942 MiB, 24568 MiB 2022/05/13 10:01:33.217, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 64 %, 32510 MiB, 7652 MiB, 24858 MiB 2022/05/13 10:01:33.217, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 50 %, 32510 MiB, 7942 MiB, 24568 MiB 2022/05/13 10:01:33.218, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 83 %, 32510 MiB, 7790 MiB, 24720 MiB 2022/05/13 10:01:33.218, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 50 %, 32510 MiB, 7942 MiB, 24568 MiB 2022/05/13 10:01:33.219, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 85 %, 32510 MiB, 7749 MiB, 24761 MiB 2022/05/13 10:01:33.220, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 50 %, 32510 MiB, 7942 MiB, 24568 MiB 2022/05/13 10:01:33.224, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 47 %, 32510 MiB, 7916 MiB, 24594 MiB 2022/05/13 10:01:33.226, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 85 %, 32510 MiB, 7812 MiB, 24698 MiB 2022/05/13 10:01:33.226, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 47 %, 32510 MiB, 7916 MiB, 24594 MiB 2022/05/13 10:01:33.231, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 50 %, 32510 MiB, 7942 MiB, 24568 MiB 2022/05/13 10:01:33.232, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 47 %, 32510 MiB, 7916 MiB, 24594 MiB 2022/05/13 10:01:33.233, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 83 %, 32510 MiB, 7790 MiB, 24720 MiB 2022/05/13 10:01:33.233, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 47 %, 32510 MiB, 7916 MiB, 24594 MiB 2022/05/13 10:01:33.237, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 59 %, 32510 MiB, 7910 MiB, 24600 MiB 2022/05/13 10:01:33.239, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 59 %, 32510 MiB, 7910 MiB, 24600 MiB 2022/05/13 10:01:33.239, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 47 %, 32510 MiB, 7916 MiB, 24594 MiB 2022/05/13 10:01:33.240, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 59 %, 32510 MiB, 7910 MiB, 24600 MiB 2022/05/13 10:01:33.241, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 50 %, 32510 MiB, 7942 MiB, 24568 MiB 2022/05/13 10:01:33.241, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 59 %, 32510 MiB, 7910 MiB, 24600 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/05/13 10:01:33.245, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 57 %, 32510 MiB, 7808 MiB, 24702 MiB 2022/05/13 10:01:33.246, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 57 %, 32510 MiB, 7808 MiB, 24702 MiB 2022/05/13 10:01:33.247, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 59 %, 32510 MiB, 7910 MiB, 24600 MiB 2022/05/13 10:01:33.247, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 57 %, 32510 MiB, 7808 MiB, 24702 MiB 2022/05/13 10:01:33.248, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 47 %, 32510 MiB, 7916 MiB, 24594 MiB 2022/05/13 10:01:33.249, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 57 %, 32510 MiB, 7808 MiB, 24702 MiB 2022/05/13 10:01:33.253, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 64 %, 32510 MiB, 7652 MiB, 24858 MiB 2022/05/13 10:01:33.252, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 85 %, 32510 MiB, 7749 MiB, 24761 MiB 2022/05/13 10:01:33.254, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 64 %, 32510 MiB, 7652 MiB, 24858 MiB 2022/05/13 10:01:33.255, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 57 %, 32510 MiB, 7808 MiB, 24702 MiB 2022/05/13 10:01:33.255, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 64 %, 32510 MiB, 7652 MiB, 24858 MiB 2022/05/13 10:01:33.256, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 59 %, 32510 MiB, 7910 MiB, 24600 MiB 2022/05/13 10:01:33.257, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 64 %, 32510 MiB, 7652 MiB, 24858 MiB 2022/05/13 10:01:33.264, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 85 %, 32510 MiB, 7812 MiB, 24698 MiB 2022/05/13 10:01:33.264, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 83 %, 32510 MiB, 7790 MiB, 24720 MiB 2022/05/13 10:01:33.266, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 85 %, 32510 MiB, 7812 MiB, 24698 MiB 2022/05/13 10:01:33.266, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 64 %, 32510 MiB, 7652 MiB, 24858 MiB 2022/05/13 10:01:33.266, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 85 %, 32510 MiB, 7812 MiB, 24698 MiB 2022/05/13 10:01:33.268, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 57 %, 32510 MiB, 7808 MiB, 24702 MiB 2022/05/13 10:01:33.268, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 85 %, 32510 MiB, 7812 MiB, 24698 MiB 2022/05/13 10:01:33.276, Tesla V100-SXM2-32GB, 470.57.02, 96 %, 86 %, 32510 MiB, 7942 MiB, 24568 MiB 2022/05/13 10:01:33.277, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 85 %, 32510 MiB, 7812 MiB, 24698 MiB 2022/05/13 10:01:33.279, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 64 %, 32510 MiB, 7652 MiB, 24858 MiB 2022/05/13 10:01:33.284, Tesla V100-SXM2-32GB, 470.57.02, 79 %, 73 %, 32510 MiB, 7916 MiB, 24594 MiB 2022/05/13 10:01:33.286, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 85 %, 32510 MiB, 7812 MiB, 24698 MiB 2022/05/13 10:01:33.289, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 59 %, 32510 MiB, 7910 MiB, 24600 MiB 2022/05/13 10:01:33.295, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 57 %, 32510 MiB, 7808 MiB, 24702 MiB 2022/05/13 10:01:33.311, Tesla V100-SXM2-32GB, 470.57.02, 74 %, 67 %, 32510 MiB, 7652 MiB, 24858 MiB 2022/05/13 10:01:33.326, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 85 %, 32510 MiB, 7812 MiB, 24698 MiB [rank:3] [train], epoch: 0/1, iter: 200/312, loss: 0.86793, top1: 0.00090, throughput: 1346.48 | 2022-05-13 10:02:11.025 [rank:2] [train], epoch: 0/1, iter: 200/312, loss: 0.86768, top1: 0.00092, throughput: 1346.58 | 2022-05-13 10:02:11.025 [rank:7] [train], epoch: 0/1, iter: 200/312, loss: 0.86773, top1: 0.00096, throughput: 1346.46 | 2022-05-13 10:02:11.026 [rank:5] [train], epoch: 0/1, iter: 200/312, loss: 0.86756, top1: 0.00082, throughput: 1346.49 | 2022-05-13 10:02:11.025 [rank:6] [train], epoch: 0/1, iter: 200/312, loss: 0.86761, top1: 0.00092, throughput: 1347.64 | 2022-05-13 10:02:11.026 [rank:1] [train], epoch: 0/1, iter: 200/312, loss: 0.86772, top1: 0.00100, throughput: 1346.53 | 2022-05-13 10:02:11.024 [rank:0] [train], epoch: 0/1, iter: 200/312, loss: 0.86752, top1: 0.00135, throughput: 1346.46 | 2022-05-13 10:02:11.027 [rank:4] [train], epoch: 0/1, iter: 200/312, loss: 0.86755, top1: 0.00105, throughput: 1346.38 | 2022-05-13 10:02:11.028 [rank:1] [train], epoch: 0/1, iter: 300/312, loss: 0.86760, top1: 0.00074, throughput: 1374.56 | 2022-05-13 10:02:48.273 [rank:3] [train], epoch: 0/1, iter: 300/312, loss: 0.86760, top1: 0.00092, throughput: 1374.58 | 2022-05-13 10:02:48.273 [rank:2] [train], epoch: 0/1, iter: 300/312, loss: 0.86784, top1: 0.00096, throughput: 1374.58 | 2022-05-13 10:02:48.273 [rank:5] [train], epoch: 0/1, iter: 300/312, loss: 0.86764, top1: 0.00088, throughput: 1374.58 | 2022-05-13 10:02:48.273 [rank:7] [train], epoch: 0/1, iter: 300/312, loss: 0.86765, top1: 0.00102, throughput: 1374.61 | 2022-05-13 10:02:48.273 [rank:6] [train], epoch: 0/1, iter: 300/312, loss: 0.86767, top1: 0.00113, throughput: 1374.56 | 2022-05-13 10:02:48.274 [rank:0] [train], epoch: 0/1, iter: 300/312, loss: 0.86779, top1: 0.00102, throughput: 1374.59 | 2022-05-13 10:02:48.274 [rank:4] [train], epoch: 0/1, iter: 300/312, loss: 0.86778, top1: 0.00109, throughput: 1374.18 | 2022-05-13 10:02:48.287