loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** loaded library: loaded library: loaded library: loaded library: loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1 /usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 W20220426 10:16:50.935801 403 rpc_client.cpp:190] LoadServer 10.7.23.113 Failed at 0 times error_code 14 error_message failed to connect to all addresses ------------------------ arguments ------------------------ batches_per_epoch ............................... 312 channel_last .................................... True ddp ............................................. False exit_num ........................................ 300 fuse_bn_add_relu ................................ True fuse_bn_relu .................................... True gpu_stat_file ................................... None grad_clipping ................................... 0.0 graph ........................................... True label_smoothing ................................. 0.1 learning_rate ................................... 4.096 legacy_init ..................................... False load_path ....................................... None lr_decay_type ................................... cosine metric_local .................................... True metric_train_acc ................................ True momentum ........................................ 0.875 nccl_fusion_max_ops ............................. 24 nccl_fusion_threshold_mb ........................ 16 num_classes ..................................... 1000 num_devices_per_node ............................ 8 num_epochs ...................................... 1 num_nodes ....................................... 1 ofrecord_part_num ............................... 256 ofrecord_path ................................... /dataset/79846248 print_interval .................................. 100 print_timestamp ................................. False samples_per_epoch ............................... 1281167 save_init ....................................... False save_path ....................................... None scale_grad ...................................... True skip_eval ....................................... True synthetic_data .................................. False total_batches ................................... -1 train_batch_size ................................ 512 train_global_batch_size ......................... 4096 use_fp16 ........................................ True use_gpu_decode .................................. True val_batch_size .................................. 50 val_batches_per_epoch ........................... 125 val_global_batch_size ........................... 400 val_samples_per_epoch ........................... 50000 warmup_epochs ................................... 5 weight_decay .................................... 3.0517578125e-05 zero_init_residual .............................. True -------------------- end of arguments --------------------- ***** Model Init ***** ***** Model Init Finish, time escapled: 2.89061 s ***** [rank:5] [train], epoch: 0/1, iter: 100/312, loss: 0.86782, top1: 0.00070, throughput: 423.93 | 2022-04-26 10:19:06.839 [rank:0] [train], epoch: 0/1, iter: 100/312, loss: 0.86773, top1: 0.00119, throughput: 423.89 | 2022-04-26 10:19:06.842 [rank:3] [train], epoch: 0/1, iter: 100/312, loss: 0.86754, top1: 0.00137, throughput: 423.88 | 2022-04-26 10:19:06.844 [rank:6] [train], epoch: 0/1, iter: 100/312, loss: 0.86746, top1: 0.00111, throughput: 423.91 | 2022-04-26 10:19:06.842 [rank:2] [train], epoch: 0/1, iter: 100/312, loss: 0.86764, top1: 0.00096, throughput: 423.88 | 2022-04-26 10:19:06.844 [rank:1] [train], epoch: 0/1, iter: 100/312, loss: 0.86780, top1: 0.00160, throughput: 423.89 | 2022-04-26 10:19:06.843 [rank:7] [train], epoch: 0/1, iter: 100/312, loss: 0.86761, top1: 0.00111, throughput: 423.94 | 2022-04-26 10:19:06.844 [rank:4] [train], epoch: 0/1, iter: 100/312, loss: 0.86762, top1: 0.00133, throughput: 423.80 | 2022-04-26 10:19:06.870 timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/04/26 10:19:07.106, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 49 %, 32510 MiB, 7791 MiB, 24719 MiB 2022/04/26 10:19:07.110, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 49 %, 32510 MiB, 7791 MiB, 24719 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/04/26 10:19:07.113, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7774 MiB, 24736 MiB 2022/04/26 10:19:07.116, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7774 MiB, 24736 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/04/26 10:19:07.118, Tesla V100-SXM2-32GB, 470.57.02, 80 %, 73 %, 32510 MiB, 7791 MiB, 24719 MiB 2022/04/26 10:19:07.118, Tesla V100-SXM2-32GB, 470.57.02, 80 %, 73 %, 32510 MiB, 7791 MiB, 24719 MiB 2022/04/26 10:19:07.119, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7928 MiB, 24582 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/04/26 10:19:07.123, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7928 MiB, 24582 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/04/26 10:19:07.123, Tesla V100-SXM2-32GB, 470.57.02, 80 %, 73 %, 32510 MiB, 7791 MiB, 24719 MiB 2022/04/26 10:19:07.124, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7774 MiB, 24736 MiB 2022/04/26 10:19:07.125, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7774 MiB, 24736 MiB 2022/04/26 10:19:07.125, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 77 %, 32510 MiB, 7892 MiB, 24618 MiB 2022/04/26 10:19:07.127, Tesla V100-SXM2-32GB, 470.57.02, 80 %, 73 %, 32510 MiB, 7791 MiB, 24719 MiB 2022/04/26 10:19:07.128, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 77 %, 32510 MiB, 7892 MiB, 24618 MiB 2022/04/26 10:19:07.128, Tesla V100-SXM2-32GB, 470.57.02, 80 %, 73 %, 32510 MiB, 7791 MiB, 24719 MiB 2022/04/26 10:19:07.129, Tesla V100-SXM2-32GB, 470.57.02, 80 %, 73 %, 32510 MiB, 7791 MiB, 24719 MiB 2022/04/26 10:19:07.130, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7774 MiB, 24736 MiB 2022/04/26 10:19:07.130, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7928 MiB, 24582 MiB 2022/04/26 10:19:07.132, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7928 MiB, 24582 MiB 2022/04/26 10:19:07.132, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 88 %, 32510 MiB, 7906 MiB, 24604 MiB 2022/04/26 10:19:07.136, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7774 MiB, 24736 MiB 2022/04/26 10:19:07.137, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 88 %, 32510 MiB, 7906 MiB, 24604 MiB 2022/04/26 10:19:07.138, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7774 MiB, 24736 MiB 2022/04/26 10:19:07.138, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7774 MiB, 24736 MiB 2022/04/26 10:19:07.139, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7928 MiB, 24582 MiB 2022/04/26 10:19:07.139, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 77 %, 32510 MiB, 7892 MiB, 24618 MiB 2022/04/26 10:19:07.141, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 77 %, 32510 MiB, 7892 MiB, 24618 MiB 2022/04/26 10:19:07.141, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 65 %, 32510 MiB, 7802 MiB, 24708 MiB 2022/04/26 10:19:07.145, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7928 MiB, 24582 MiB 2022/04/26 10:19:07.146, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 65 %, 32510 MiB, 7802 MiB, 24708 MiB 2022/04/26 10:19:07.147, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7928 MiB, 24582 MiB 2022/04/26 10:19:07.147, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 86 %, 32510 MiB, 7928 MiB, 24582 MiB 2022/04/26 10:19:07.148, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 77 %, 32510 MiB, 7892 MiB, 24618 MiB 2022/04/26 10:19:07.148, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 88 %, 32510 MiB, 7906 MiB, 24604 MiB 2022/04/26 10:19:07.150, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 88 %, 32510 MiB, 7906 MiB, 24604 MiB 2022/04/26 10:19:07.150, Tesla V100-SXM2-32GB, 470.57.02, 92 %, 84 %, 32510 MiB, 7684 MiB, 24826 MiB 2022/04/26 10:19:07.154, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 77 %, 32510 MiB, 7892 MiB, 24618 MiB 2022/04/26 10:19:07.154, Tesla V100-SXM2-32GB, 470.57.02, 92 %, 84 %, 32510 MiB, 7684 MiB, 24826 MiB 2022/04/26 10:19:07.156, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 77 %, 32510 MiB, 7892 MiB, 24618 MiB 2022/04/26 10:19:07.156, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 77 %, 32510 MiB, 7892 MiB, 24618 MiB 2022/04/26 10:19:07.157, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 88 %, 32510 MiB, 7906 MiB, 24604 MiB 2022/04/26 10:19:07.157, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 65 %, 32510 MiB, 7802 MiB, 24708 MiB 2022/04/26 10:19:07.159, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 65 %, 32510 MiB, 7802 MiB, 24708 MiB 2022/04/26 10:19:07.159, Tesla V100-SXM2-32GB, 470.57.02, 84 %, 77 %, 32510 MiB, 7808 MiB, 24702 MiB 2022/04/26 10:19:07.163, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 88 %, 32510 MiB, 7906 MiB, 24604 MiB 2022/04/26 10:19:07.164, Tesla V100-SXM2-32GB, 470.57.02, 84 %, 77 %, 32510 MiB, 7808 MiB, 24702 MiB 2022/04/26 10:19:07.165, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 88 %, 32510 MiB, 7906 MiB, 24604 MiB 2022/04/26 10:19:07.166, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 88 %, 32510 MiB, 7906 MiB, 24604 MiB 2022/04/26 10:19:07.166, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 65 %, 32510 MiB, 7802 MiB, 24708 MiB 2022/04/26 10:19:07.166, Tesla V100-SXM2-32GB, 470.57.02, 92 %, 84 %, 32510 MiB, 7684 MiB, 24826 MiB 2022/04/26 10:19:07.168, Tesla V100-SXM2-32GB, 470.57.02, 92 %, 84 %, 32510 MiB, 7684 MiB, 24826 MiB 2022/04/26 10:19:07.172, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 65 %, 32510 MiB, 7802 MiB, 24708 MiB 2022/04/26 10:19:07.173, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 65 %, 32510 MiB, 7802 MiB, 24708 MiB 2022/04/26 10:19:07.174, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 65 %, 32510 MiB, 7802 MiB, 24708 MiB 2022/04/26 10:19:07.175, Tesla V100-SXM2-32GB, 470.57.02, 92 %, 84 %, 32510 MiB, 7684 MiB, 24826 MiB 2022/04/26 10:19:07.175, Tesla V100-SXM2-32GB, 470.57.02, 84 %, 77 %, 32510 MiB, 7808 MiB, 24702 MiB 2022/04/26 10:19:07.176, Tesla V100-SXM2-32GB, 470.57.02, 84 %, 77 %, 32510 MiB, 7808 MiB, 24702 MiB 2022/04/26 10:19:07.180, Tesla V100-SXM2-32GB, 470.57.02, 92 %, 84 %, 32510 MiB, 7684 MiB, 24826 MiB 2022/04/26 10:19:07.181, Tesla V100-SXM2-32GB, 470.57.02, 92 %, 84 %, 32510 MiB, 7684 MiB, 24826 MiB 2022/04/26 10:19:07.182, Tesla V100-SXM2-32GB, 470.57.02, 92 %, 84 %, 32510 MiB, 7684 MiB, 24826 MiB 2022/04/26 10:19:07.182, Tesla V100-SXM2-32GB, 470.57.02, 84 %, 77 %, 32510 MiB, 7808 MiB, 24702 MiB 2022/04/26 10:19:07.188, Tesla V100-SXM2-32GB, 470.57.02, 84 %, 77 %, 32510 MiB, 7808 MiB, 24702 MiB 2022/04/26 10:19:07.189, Tesla V100-SXM2-32GB, 470.57.02, 84 %, 77 %, 32510 MiB, 7808 MiB, 24702 MiB 2022/04/26 10:19:07.190, Tesla V100-SXM2-32GB, 470.57.02, 84 %, 77 %, 32510 MiB, 7808 MiB, 24702 MiB [rank:6] [train], epoch: 0/1, iter: 200/312, loss: 0.86776, top1: 0.00088, throughput: 1322.84 | 2022-04-26 10:19:45.547 [rank:5] [train], epoch: 0/1, iter: 200/312, loss: 0.86777, top1: 0.00137, throughput: 1322.72 | 2022-04-26 10:19:45.547 [rank:2] [train], epoch: 0/1, iter: 200/312, loss: 0.86755, top1: 0.00125, throughput: 1322.86 | 2022-04-26 10:19:45.548 [rank:1] [train], epoch: 0/1, iter: 200/312, loss: 0.86751, top1: 0.00111, throughput: 1322.77 | 2022-04-26 10:19:45.549 [rank:3] [train], epoch: 0/1, iter: 200/312, loss: 0.86760, top1: 0.00111, throughput: 1322.77 | 2022-04-26 10:19:45.550 [rank:7] [train], epoch: 0/1, iter: 200/312, loss: 0.86759, top1: 0.00098, throughput: 1322.76[rank:4] [train], epoch: 0/1, iter: 200/312, loss: 0.86765, top1: 0.00111, throughput: 1323.77 | 2022-04-26 10:19:45.547 | 2022-04-26 10:19:45.551 [rank:0] [train], epoch: 0/1, iter: 200/312, loss: 0.86766, top1: 0.00094, throughput: 1322.71 | 2022-04-26 10:19:45.550 [rank:0] [train], epoch: 0/1, iter: 300/312, loss: 0.86758, top1: 0.00143, throughput: 1369.65 | 2022-04-26 10:20:22.932 [rank:2] [train], epoch: 0/1, iter: 300/312, loss: 0.86739, top1: 0.00131, throughput: 1369.62 | 2022-04-26 10:20:22.931 [rank:6] [train], epoch: 0/1, iter: 300/312, loss: 0.86757, top1: 0.00113, throughput: 1369.59 | 2022-04-26 10:20:22.930 [rank:3] [train], epoch: 0/1, iter: 300/312, loss: 0.86745, top1: 0.00146, throughput: 1369.61 | 2022-04-26 10:20:22.933 [rank:1] [train], epoch: 0/1, iter: 300/312, loss: 0.86752, top1: 0.00139, throughput: 1369.62 | 2022-04-26 10:20:22.932 [rank:5] [train], epoch: 0/1, iter: 300/312, loss: 0.86759, top1: 0.00092, throughput: 1369.63 | 2022-04-26 10:20:22.930 [rank:7] [train], epoch: 0/1, iter: 300/312, loss: 0.86770, top1: 0.00141, throughput: 1369.63 | 2022-04-26 10:20:22.933 [rank:4] [train], epoch: 0/1, iter: 300/312, loss: 0.86750, top1: 0.00107, throughput: 1369.54 | 2022-04-26 10:20:22.932