loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** loaded library: loaded library: loaded library: loaded library: loaded library: loaded library: loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 W20220428 10:31:54.974186 2037 rpc_client.cpp:190] LoadServer 10.7.252.45 Failed at 0 times error_code 14 error_message failed to connect to all addresses W20220428 10:31:54.975073 2039 rpc_client.cpp:190] LoadServer 10.7.252.45 Failed at 0 times error_code 14 error_message failed to connect to all addresses W20220428 10:31:54.975169 2042 rpc_client.cpp:190] LoadServer 10.7.252.45 Failed at 0 times error_code 14 error_message failed to connect to all addresses ------------------------ arguments ------------------------ batches_per_epoch ............................... 625 channel_last .................................... False ddp ............................................. False exit_num ........................................ 300 fuse_bn_add_relu ................................ True fuse_bn_relu .................................... True gpu_stat_file ................................... None grad_clipping ................................... 0.0 graph ........................................... True label_smoothing ................................. 0.1 learning_rate ................................... 2.048 legacy_init ..................................... False load_path ....................................... None lr_decay_type ................................... cosine metric_local .................................... True metric_train_acc ................................ True momentum ........................................ 0.875 nccl_fusion_max_ops ............................. 24 nccl_fusion_threshold_mb ........................ 16 num_classes ..................................... 1000 num_devices_per_node ............................ 8 num_epochs ...................................... 1 num_nodes ....................................... 1 ofrecord_part_num ............................... 256 ofrecord_path ................................... /dataset/79846248 print_interval .................................. 100 print_timestamp ................................. False samples_per_epoch ............................... 1281167 save_init ....................................... False save_path ....................................... None scale_grad ...................................... True skip_eval ....................................... True synthetic_data .................................. False total_batches ................................... -1 train_batch_size ................................ 256 train_global_batch_size ......................... 2048 use_fp16 ........................................ False use_gpu_decode .................................. True val_batch_size .................................. 50 val_batches_per_epoch ........................... 125 val_global_batch_size ........................... 400 val_samples_per_epoch ........................... 50000 warmup_epochs ................................... 5 weight_decay .................................... 3.0517578125e-05 zero_init_residual .............................. True -------------------- end of arguments --------------------- ***** Model Init ***** ***** Model Init Finish, time escapled: 3.02669 s ***** [rank:0] [train], epoch: 0/1, iter: 100/625, loss: 0.86769, top1: 0.00129, throughput: 263.49 | 2022-04-28 10:33:47.363 [rank:3] [train], epoch: 0/1, iter: 100/625, loss: 0.86779, top1: 0.00109, throughput: 263.49 | 2022-04-28 10:33:47.362 [rank:2] [train], epoch: 0/1, iter: 100/625, loss: 0.86769, top1: 0.00121, throughput: 263.50 | 2022-04-28 10:33:47.364 [rank:6] [train], epoch: 0/1, iter: 100/625, loss: 0.86757, top1: 0.00137, throughput: 263.49 | 2022-04-28 10:33:47.362 [rank:1] [train], epoch: 0/1, iter: 100/625, loss: 0.86790, top1: 0.00082, throughput: 263.50 | 2022-04-28 10:33:47.363 [rank:5] [train], epoch: 0/1, iter: 100/625, loss: 0.86769, top1: 0.00102, throughput: 263.49 | 2022-04-28 10:33:47.364 [rank:4] [train], epoch: 0/1, iter: 100/625, loss: 0.86759, top1: 0.00137, throughput: 263.49 | 2022-04-28 10:33:47.365 [rank:7] [train], epoch: 0/1, iter: 100/625, loss: 0.86756, top1: 0.00113, throughput: 263.52 | 2022-04-28 10:33:47.362 timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/04/28 10:33:47.640, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 60 %, 32510 MiB, 7811 MiB, 24699 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/04/28 10:33:47.646, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 60 %, 32510 MiB, 7811 MiB, 24699 MiB 2022/04/28 10:33:47.647, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 61 %, 32510 MiB, 7708 MiB, 24802 MiB 2022/04/28 10:33:47.648, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 60 %, 32510 MiB, 7811 MiB, 24699 MiB 2022/04/28 10:33:47.662, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 60 %, 32510 MiB, 7811 MiB, 24699 MiB 2022/04/28 10:33:47.666, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 61 %, 32510 MiB, 7708 MiB, 24802 MiB 2022/04/28 10:33:47.666, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 58 %, 32510 MiB, 7886 MiB, 24624 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/04/28 10:33:47.668, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 61 %, 32510 MiB, 7708 MiB, 24802 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/04/28 10:33:47.671, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 61 %, 32510 MiB, 7708 MiB, 24802 MiB 2022/04/28 10:33:47.672, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 58 %, 32510 MiB, 7886 MiB, 24624 MiB 2022/04/28 10:33:47.673, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 60 %, 32510 MiB, 7914 MiB, 24596 MiB 2022/04/28 10:33:47.674, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 58 %, 32510 MiB, 7886 MiB, 24624 MiB 2022/04/28 10:33:47.674, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 60 %, 32510 MiB, 7811 MiB, 24699 MiB 2022/04/28 10:33:47.675, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 60 %, 32510 MiB, 7811 MiB, 24699 MiB 2022/04/28 10:33:47.676, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 60 %, 32510 MiB, 7811 MiB, 24699 MiB 2022/04/28 10:33:47.677, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 60 %, 32510 MiB, 7811 MiB, 24699 MiB 2022/04/28 10:33:47.679, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 58 %, 32510 MiB, 7886 MiB, 24624 MiB 2022/04/28 10:33:47.682, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 60 %, 32510 MiB, 7914 MiB, 24596 MiB 2022/04/28 10:33:47.682, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 62 %, 32510 MiB, 7882 MiB, 24628 MiB 2022/04/28 10:33:47.684, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 60 %, 32510 MiB, 7914 MiB, 24596 MiB 2022/04/28 10:33:47.684, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 61 %, 32510 MiB, 7708 MiB, 24802 MiB 2022/04/28 10:33:47.685, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 61 %, 32510 MiB, 7708 MiB, 24802 MiB 2022/04/28 10:33:47.687, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 61 %, 32510 MiB, 7708 MiB, 24802 MiB 2022/04/28 10:33:47.688, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 61 %, 32510 MiB, 7708 MiB, 24802 MiB 2022/04/28 10:33:47.688, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 60 %, 32510 MiB, 7914 MiB, 24596 MiB 2022/04/28 10:33:47.690, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 62 %, 32510 MiB, 7882 MiB, 24628 MiB 2022/04/28 10:33:47.690, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 59 %, 32510 MiB, 7776 MiB, 24734 MiB 2022/04/28 10:33:47.692, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 62 %, 32510 MiB, 7882 MiB, 24628 MiB 2022/04/28 10:33:47.693, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 58 %, 32510 MiB, 7886 MiB, 24624 MiB 2022/04/28 10:33:47.693, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 58 %, 32510 MiB, 7886 MiB, 24624 MiB 2022/04/28 10:33:47.695, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 58 %, 32510 MiB, 7886 MiB, 24624 MiB 2022/04/28 10:33:47.696, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 58 %, 32510 MiB, 7886 MiB, 24624 MiB 2022/04/28 10:33:47.696, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 62 %, 32510 MiB, 7882 MiB, 24628 MiB 2022/04/28 10:33:47.699, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 59 %, 32510 MiB, 7776 MiB, 24734 MiB 2022/04/28 10:33:47.699, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 61 %, 32510 MiB, 7592 MiB, 24918 MiB 2022/04/28 10:33:47.701, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 59 %, 32510 MiB, 7776 MiB, 24734 MiB 2022/04/28 10:33:47.701, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 60 %, 32510 MiB, 7914 MiB, 24596 MiB 2022/04/28 10:33:47.702, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 60 %, 32510 MiB, 7914 MiB, 24596 MiB 2022/04/28 10:33:47.704, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 60 %, 32510 MiB, 7914 MiB, 24596 MiB 2022/04/28 10:33:47.705, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 60 %, 32510 MiB, 7914 MiB, 24596 MiB 2022/04/28 10:33:47.705, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 59 %, 32510 MiB, 7776 MiB, 24734 MiB 2022/04/28 10:33:47.707, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 61 %, 32510 MiB, 7592 MiB, 24918 MiB 2022/04/28 10:33:47.707, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 62 %, 32510 MiB, 7734 MiB, 24776 MiB 2022/04/28 10:33:47.709, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 61 %, 32510 MiB, 7592 MiB, 24918 MiB 2022/04/28 10:33:47.710, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 62 %, 32510 MiB, 7882 MiB, 24628 MiB 2022/04/28 10:33:47.710, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 62 %, 32510 MiB, 7882 MiB, 24628 MiB 2022/04/28 10:33:47.712, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 62 %, 32510 MiB, 7882 MiB, 24628 MiB 2022/04/28 10:33:47.713, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 62 %, 32510 MiB, 7882 MiB, 24628 MiB 2022/04/28 10:33:47.713, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 61 %, 32510 MiB, 7592 MiB, 24918 MiB 2022/04/28 10:33:47.716, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 62 %, 32510 MiB, 7734 MiB, 24776 MiB 2022/04/28 10:33:47.718, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 62 %, 32510 MiB, 7734 MiB, 24776 MiB 2022/04/28 10:33:47.718, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 59 %, 32510 MiB, 7776 MiB, 24734 MiB 2022/04/28 10:33:47.719, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 59 %, 32510 MiB, 7776 MiB, 24734 MiB 2022/04/28 10:33:47.721, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 59 %, 32510 MiB, 7776 MiB, 24734 MiB 2022/04/28 10:33:47.722, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 59 %, 32510 MiB, 7776 MiB, 24734 MiB 2022/04/28 10:33:47.722, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 62 %, 32510 MiB, 7734 MiB, 24776 MiB 2022/04/28 10:33:47.727, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 61 %, 32510 MiB, 7592 MiB, 24918 MiB 2022/04/28 10:33:47.727, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 61 %, 32510 MiB, 7592 MiB, 24918 MiB 2022/04/28 10:33:47.729, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 61 %, 32510 MiB, 7592 MiB, 24918 MiB 2022/04/28 10:33:47.730, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 61 %, 32510 MiB, 7592 MiB, 24918 MiB 2022/04/28 10:33:47.734, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 62 %, 32510 MiB, 7734 MiB, 24776 MiB 2022/04/28 10:33:47.735, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 62 %, 32510 MiB, 7734 MiB, 24776 MiB 2022/04/28 10:33:47.737, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 62 %, 32510 MiB, 7734 MiB, 24776 MiB 2022/04/28 10:33:47.737, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 62 %, 32510 MiB, 7734 MiB, 24776 MiB [rank:5] [train], epoch: 0/1, iter: 200/625, loss: 0.86757, top1: 0.00109, throughput: 371.33 | 2022-04-28 10:34:56.305 [rank:3] [train], epoch: 0/1, iter: 200/625, loss: 0.86768, top1: 0.00129, throughput: 371.32 | 2022-04-28 10:34:56.305 [rank:1] [train], epoch: 0/1, iter: 200/625, loss: 0.86785, top1: 0.00102, throughput: 371.33 | 2022-04-28 10:34:56.305 [rank:0] [train], epoch: 0/1, iter: 200/625, loss: 0.86769, top1: 0.00105, throughput: 371.32 | 2022-04-28 10:34:56.306 [rank:6] [train], epoch: 0/1, iter: 200/625, loss: 0.86762, top1: 0.00129, throughput: 371.32 | 2022-04-28 10:34:56.306 [rank:7] [train], epoch: 0/1, iter: 200/625, loss: 0.86764, top1: 0.00109, throughput: 371.32 | 2022-04-28 10:34:56.306 [rank:2] [train], epoch: 0/1, iter: 200/625, loss: 0.86774, top1: 0.00133, throughput: 371.33 | 2022-04-28 10:34:56.306 [rank:4] [train], epoch: 0/1, iter: 200/625, loss: 0.86769, top1: 0.00098, throughput: 371.32 | 2022-04-28 10:34:56.309 [rank:0] [train], epoch: 0/1, iter: 300/625, loss: 0.86773, top1: 0.00105, throughput: 373.81 | 2022-04-28 10:36:04.791 [rank:1] [train], epoch: 0/1, iter: 300/625, loss: 0.86787, top1: 0.00152, throughput: 373.80 | 2022-04-28 10:36:04.791 [rank:6] [train], epoch: 0/1, iter: 300/625, loss: 0.86769, top1: 0.00121, throughput: 373.81 | 2022-04-28 10:36:04.791 [rank:2] [train], epoch: 0/1, iter: 300/625, loss: 0.86769, top1: 0.00098, throughput: 373.80 | 2022-04-28 10:36:04.791 [rank:5] [train], epoch: 0/1, iter: 300/625, loss: 0.86782, top1: 0.00105, throughput: 373.80 | 2022-04-28 10:36:04.791 [rank:4] [train], epoch: 0/1, iter: 300/625, loss: 0.86733, top1: 0.00117, throughput: 373.81 | 2022-04-28 10:36:04.793 [rank:3] [train], epoch: 0/1, iter: 300/625, loss: 0.86772, top1: 0.00137, throughput: 373.80 | 2022-04-28 10:36:04.791 [rank:7] [train], epoch: 0/1, iter: 300/625, loss: 0.86778, top1: 0.00109, throughput: 373.80 | 2022-04-28 10:36:04.792