loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** loaded library: loaded library: loaded library: loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1 /usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 /usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1 ------------------------ arguments ------------------------ batches_per_epoch ............................... 312 channel_last .................................... True ddp ............................................. False exit_num ........................................ 300 fuse_bn_add_relu ................................ True fuse_bn_relu .................................... True gpu_stat_file ................................... None grad_clipping ................................... 0.0 graph ........................................... True label_smoothing ................................. 0.1 learning_rate ................................... 4.096 legacy_init ..................................... False load_path ....................................... None lr_decay_type ................................... cosine metric_local .................................... True metric_train_acc ................................ True momentum ........................................ 0.875 nccl_fusion_max_ops ............................. 24 nccl_fusion_threshold_mb ........................ 16 num_classes ..................................... 1000 num_devices_per_node ............................ 8 num_epochs ...................................... 1 num_nodes ....................................... 1 ofrecord_part_num ............................... 256 ofrecord_path ................................... /dataset/79846248 print_interval .................................. 100 print_timestamp ................................. False samples_per_epoch ............................... 1281167 save_init ....................................... False save_path ....................................... None scale_grad ...................................... True skip_eval ....................................... True synthetic_data .................................. False total_batches ................................... -1 train_batch_size ................................ 512 train_global_batch_size ......................... 4096 use_fp16 ........................................ True use_gpu_decode .................................. True val_batch_size .................................. 50 val_batches_per_epoch ........................... 125 val_global_batch_size ........................... 400 val_samples_per_epoch ........................... 50000 warmup_epochs ................................... 5 weight_decay .................................... 3.0517578125e-05 zero_init_residual .............................. True -------------------- end of arguments --------------------- Killing subprocess 6321 Killing subprocess 6322 Killing subprocess 6323 Killing subprocess 6324 Killing subprocess 6325 Killing subprocess 6326 Killing subprocess 6327 Killing subprocess 6330 Traceback (most recent call last): File "/usr/local/miniconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/local/miniconda3/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/miniconda3/lib/python3.7/site-packages/oneflow/distributed/launch.py", line 231, in main() File "/usr/local/miniconda3/lib/python3.7/site-packages/oneflow/distributed/launch.py", line 219, in main sigkill_handler(signal.SIGTERM, None) File "/usr/local/miniconda3/lib/python3.7/site-packages/oneflow/distributed/launch.py", line 188, in sigkill_handler returncode=last_return_code, cmd=cmd subprocess.CalledProcessError: Command '['/usr/local/miniconda3/bin/python3', '-u', '/dataset/e1a63606/onebench/scripts/models_nsys/Vision/classification/image/resnet50/train.py', '--ofrecord-path', '/dataset/79846248', '--ofrecord-part-num', '256', '--num-devices-per-node', '8', '--lr', '4.096', '--momentum', '0.875', '--num-epochs', '1', '--train-batch-size', '512', '--val-batch-size', '50', '--print-interval', '100', '--exit-num', '300', '--use-fp16', '--channel-last', '--skip-eval', '--scale-grad', '--graph', '--fuse-bn-relu', '--fuse-bn-add-relu', '--use-gpu-decode']' died with .