loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** loaded library: loaded library: loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1 /usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: loaded library: loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1 W20220517 13:06:21.024636 7454 rpc_client.cpp:190] LoadServer 10.7.127.9 Failed at 0 times error_code 14 error_message failed to connect to all addresses ------------------------ arguments ------------------------ batches_per_epoch ............................... 625 channel_last .................................... False ddp ............................................. False exit_num ........................................ 300 fuse_bn_add_relu ................................ True fuse_bn_relu .................................... True gpu_stat_file ................................... None grad_clipping ................................... 0.0 graph ........................................... True label_smoothing ................................. 0.1 learning_rate ................................... 2.048 legacy_init ..................................... False load_path ....................................... None lr_decay_type ................................... cosine metric_local .................................... True metric_train_acc ................................ True momentum ........................................ 0.875 nccl_fusion_max_ops ............................. 24 nccl_fusion_threshold_mb ........................ 16 num_classes ..................................... 1000 num_devices_per_node ............................ 8 num_epochs ...................................... 1 num_nodes ....................................... 1 ofrecord_part_num ............................... 256 ofrecord_path ................................... /dataset/79846248 print_interval .................................. 100 print_timestamp ................................. False samples_per_epoch ............................... 1281167 save_init ....................................... False save_path ....................................... None scale_grad ...................................... True skip_eval ....................................... True synthetic_data .................................. False total_batches ................................... -1 train_batch_size ................................ 256 train_global_batch_size ......................... 2048 use_fp16 ........................................ False use_gpu_decode .................................. True val_batch_size .................................. 50 val_batches_per_epoch ........................... 125 val_global_batch_size ........................... 400 val_samples_per_epoch ........................... 50000 warmup_epochs ................................... 5 weight_decay .................................... 3.0517578125e-05 zero_init_residual .............................. True -------------------- end of arguments --------------------- Killing subprocess 7450 Killing subprocess 7451 Killing subprocess 7452 Killing subprocess 7453 Killing subprocess 7454 Killing subprocess 7455 Killing subprocess 7456 Killing subprocess 7459 Traceback (most recent call last): File "/usr/local/miniconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/local/miniconda3/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/miniconda3/lib/python3.7/site-packages/oneflow/distributed/launch.py", line 231, in main() File "/usr/local/miniconda3/lib/python3.7/site-packages/oneflow/distributed/launch.py", line 219, in main sigkill_handler(signal.SIGTERM, None) File "/usr/local/miniconda3/lib/python3.7/site-packages/oneflow/distributed/launch.py", line 188, in sigkill_handler returncode=last_return_code, cmd=cmd subprocess.CalledProcessError: Command '['/usr/local/miniconda3/bin/python3', '-u', '/dataset/e1a63606/onebench/scripts/models_nsys/Vision/classification/image/resnet50/train.py', '--ofrecord-path', '/dataset/79846248', '--ofrecord-part-num', '256', '--num-devices-per-node', '8', '--lr', '2.048', '--momentum', '0.875', '--num-epochs', '1', '--train-batch-size', '256', '--val-batch-size', '50', '--print-interval', '100', '--exit-num', '300', '--skip-eval', '--scale-grad', '--graph', '--fuse-bn-relu', '--fuse-bn-add-relu', '--use-gpu-decode']' died with .