The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
 Please read local_rank from `os.environ('LOCAL_RANK')` instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : pretrain_gpt.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 4
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:6000
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_phen2ni2/none_m870rk7s
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=6000
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3]
  role_ranks=[0, 1, 2, 3]
  global_ranks=[0, 1, 2, 3]
  role_world_sizes=[4, 4, 4, 4]
  global_world_sizes=[4, 4, 4, 4]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_phen2ni2/none_m870rk7s/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_phen2ni2/none_m870rk7s/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_phen2ni2/none_m870rk7s/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_phen2ni2/none_m870rk7s/attempt_0/3/error.json
using world size: 4, data-parallel-size: 2, tensor-model-parallel size: 2, pipeline-model-parallel size: 1 
using torch.float16 for parameters ...
Persistent fused layer norm kernel is supported from pytorch v1.11 (nvidia pytorch container paired with v1.11). Defaulting to no_persist_layer_norm=True
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  activations_checkpoint_method ................... uniform
  activations_checkpoint_num_layers ............... 1
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_query_key_layer_scaling ................... True
  apply_residual_connection_post_layernorm ........ False
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... False
  bert_binary_head ................................ True
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  data_impl ....................................... mmap
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 2
  data_path ....................................... ['/dataset/source/dataset/loss_compara_content_sentence']
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  DDP_impl ........................................ local
  decoder_seq_length .............................. None
  distribute_checkpointed_activations ............. True
  distributed_backend ............................. nccl
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  encoder_seq_length .............................. 1024
  eod_mask_loss ................................... False
  eval_interval ................................... 1000
  eval_iters ...................................... 10
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_signal_handler ............................. False
  ffn_hidden_size ................................. 4096
  finetune ........................................ False
  fp16 ............................................ True
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  global_batch_size ............................... 512
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 1024
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  kv_channels ..................................... 64
  layernorm_epsilon ............................... 1e-05
  lazy_mpu_init ................................... None
  load ............................................ None
  local_rank ...................................... 0
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 100
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.00015
  lr_decay_iters .................................. 320000
  lr_decay_samples ................................ None
  lr_decay_style .................................. cosine
  lr_warmup_fraction .............................. 0.01
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  mask_prob ....................................... 0.15
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 1024
  merge_file ...................................... /dataset/source/dataset/gpt2-merges.txt
  micro_batch_size ................................ 32
  min_loss_scale .................................. 1.0
  min_lr .......................................... 1e-05
  mmap_warmup ..................................... False
  no_async_tensor_model_parallel_allreduce ........ False
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_persist_layer_norm ........................... True
  no_save_optim ................................... None
  no_save_rng ..................................... None
  num_attention_heads ............................. 16
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_layers ...................................... 24
  num_layers_per_virtual_pipeline_stage ........... None
  num_workers ..................................... 2
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  override_lr_scheduler ........................... False
  params_dtype .................................... torch.float16
  patch_dim ....................................... 16
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  sample_rate ..................................... 1.0
  save ............................................ None
  save_interval ................................... 10000
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 1024
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  split ........................................... 949,50,1
  tensor_model_parallel_size ...................... 2
  tensorboard_dir ................................. None
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  titles_data_path ................................ None
  tokenizer_type .................................. GPT2BPETokenizer
  train_iters ..................................... 210
  train_samples ................................... None
  use_checkpoint_lr_scheduler ..................... False
  use_contiguous_buffers_in_local_ddp ............. True
  use_cpu_initialization .......................... None
  use_one_sent_docs ............................... False
  virtual_pipeline_model_parallel_size ............ None
  vocab_extra_ids ................................. 0
  vocab_file ...................................... /dataset/source/dataset/gpt2-vocab.json
  weight_decay .................................... 0.01
  world_size ...................................... 4
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 8
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 175 dummy tokens (new size: 50432)
> initializing torch distributed ...
> initializing tensor model parallel with size 2
> initializing pipeline model parallel with size 1
[W ProcessGroupNCCL.cpp:1671] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1671] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
> compiling dataset index builder ...
[W ProcessGroupNCCL.cpp:1671] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
make: Entering directory '/dataset/workspace/Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/dataset/workspace/Megatron-LM/megatron/data'
>>> done with dataset index builder. Compilation time: 0.037 seconds
> compiling and loading fused kernels ...
Detected CUDA files, patching ldflags
Emitting ninja build file /dataset/workspace/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_upper_triang_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /dataset/workspace/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /dataset/workspace/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /dataset/workspace/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_mix_prec_layer_norm_cuda...
[W ProcessGroupNCCL.cpp:1671] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
iv-2udaavw4l02thdv8lcrl:44177:44177 [0] NCCL INFO Bootstrap : Using eth0:192.168.11.42<0>
iv-2udaavw4l02thdv8lcrl:44177:44177 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
iv-2udaavw4l02thdv8lcrl:44177:44177 [0] NCCL INFO P2P plugin IBext
iv-2udaavw4l02thdv8lcrl:44177:44177 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
iv-2udaavw4l02thdv8lcrl:44177:44177 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.11.42<0>
iv-2udaavw4l02thdv8lcrl:44177:44177 [0] NCCL INFO Using network IBext
NCCL version 2.10.3+cuda11.4
iv-2udaavw4l02thdv8lcrl:44179:44179 [2] NCCL INFO Bootstrap : Using eth0:192.168.11.42<0>
iv-2udaavw4l02thdv8lcrl:44178:44178 [1] NCCL INFO Bootstrap : Using eth0:192.168.11.42<0>
iv-2udaavw4l02thdv8lcrl:44179:44179 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
iv-2udaavw4l02thdv8lcrl:44179:44179 [2] NCCL INFO P2P plugin IBext
iv-2udaavw4l02thdv8lcrl:44179:44179 [2] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
iv-2udaavw4l02thdv8lcrl:44178:44178 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
iv-2udaavw4l02thdv8lcrl:44178:44178 [1] NCCL INFO P2P plugin IBext
iv-2udaavw4l02thdv8lcrl:44178:44178 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
iv-2udaavw4l02thdv8lcrl:44180:44180 [3] NCCL INFO Bootstrap : Using eth0:192.168.11.42<0>
iv-2udaavw4l02thdv8lcrl:44179:44179 [2] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.11.42<0>
iv-2udaavw4l02thdv8lcrl:44179:44179 [2] NCCL INFO Using network IBext
iv-2udaavw4l02thdv8lcrl:44178:44178 [1] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.11.42<0>
iv-2udaavw4l02thdv8lcrl:44178:44178 [1] NCCL INFO Using network IBext
iv-2udaavw4l02thdv8lcrl:44180:44180 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
iv-2udaavw4l02thdv8lcrl:44180:44180 [3] NCCL INFO P2P plugin IBext
iv-2udaavw4l02thdv8lcrl:44180:44180 [3] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
iv-2udaavw4l02thdv8lcrl:44180:44180 [3] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.11.42<0>
iv-2udaavw4l02thdv8lcrl:44180:44180 [3] NCCL INFO Using network IBext
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23.
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23.
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7.
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7.
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23.
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7.
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23.
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7.
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 0/-1/-1->2->3 [2] 3/-1/-1->2->0 [3] 0/-1/-1->2->3 [4] 3/-1/-1->2->0 [5] 0/-1/-1->2->3 [6] 3/-1/-1->2->0 [7] 0/-1/-1->2->3
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Setting affinity for GPU 2 to 03ff,ffffffff
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Trees [0] 1/-1/-1->3->2 [1] 2/-1/-1->3->1 [2] 1/-1/-1->3->2 [3] 2/-1/-1->3->1 [4] 1/-1/-1->3->2 [5] 2/-1/-1->3->1 [6] 1/-1/-1->3->2 [7] 2/-1/-1->3->1
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Setting affinity for GPU 3 to 03ff,ffffffff
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 00/08 :    0   2   3   1
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 01/08 :    0   2   1   3
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO Trees [0] -1/-1/-1->1->3 [1] 3/-1/-1->1->-1 [2] -1/-1/-1->1->3 [3] 3/-1/-1->1->-1 [4] -1/-1/-1->1->3 [5] 3/-1/-1->1->-1 [6] -1/-1/-1->1->3 [7] 3/-1/-1->1->-1
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 02/08 :    0   1   3   2
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 03/08 :    0   3   1   2
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 04/08 :    0   2   3   1
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO Setting affinity for GPU 1 to 03ff,ffffffff
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 05/08 :    0   2   1   3
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 06/08 :    0   1   3   2
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 07/08 :    0   3   1   2
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Trees [0] 2/-1/-1->0->-1 [1] -1/-1/-1->0->2 [2] 2/-1/-1->0->-1 [3] -1/-1/-1->0->2 [4] 2/-1/-1->0->-1 [5] -1/-1/-1->0->2 [6] 2/-1/-1->0->-1 [7] -1/-1/-1->0->2
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO Channel 03 : 1[65020] -> 2[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO Channel 07 : 1[65020] -> 2[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 02 : 0[65010] -> 1[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 06 : 0[65010] -> 1[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Channel 01 : 3[67020] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Channel 00 : 2[67010] -> 3[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Channel 05 : 3[67020] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Channel 04 : 2[67010] -> 3[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO Channel 01 : 1[65020] -> 3[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 00 : 0[65010] -> 2[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Channel 02 : 2[67010] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Channel 00 : 3[67020] -> 1[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 01 : 0[65010] -> 2[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO Channel 02 : 1[65020] -> 3[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Channel 03 : 2[67010] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Channel 03 : 3[67020] -> 1[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 04 : 0[65010] -> 2[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO Channel 05 : 1[65020] -> 3[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Channel 06 : 2[67010] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Channel 04 : 3[67020] -> 1[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 05 : 0[65010] -> 2[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO Channel 06 : 1[65020] -> 3[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Channel 07 : 2[67010] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Channel 07 : 3[67020] -> 1[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 03 : 0[65010] -> 3[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Channel 01 : 2[67010] -> 1[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO Channel 00 : 1[65020] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Channel 02 : 3[67020] -> 2[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 07 : 0[65010] -> 3[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO Channel 04 : 1[65020] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Channel 05 : 2[67010] -> 1[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Channel 06 : 3[67020] -> 2[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO Connected all rings
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Connected all rings
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Connected all rings
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Connected all rings
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Channel 01 : 2[67010] -> 3[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Channel 02 : 2[67010] -> 3[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Channel 03 : 2[67010] -> 3[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Channel 05 : 2[67010] -> 3[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 02 : 0[65010] -> 2[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Channel 06 : 2[67010] -> 3[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 03 : 0[65010] -> 2[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Channel 07 : 2[67010] -> 3[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 06 : 0[65010] -> 2[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Channel 07 : 0[65010] -> 2[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO Channel 00 : 1[65020] -> 3[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO Channel 03 : 1[65020] -> 3[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO Channel 04 : 1[65020] -> 3[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO Channel 07 : 1[65020] -> 3[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Channel 01 : 3[67020] -> 1[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Channel 00 : 2[67010] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Channel 02 : 3[67020] -> 1[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Channel 01 : 2[67010] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Channel 05 : 3[67020] -> 1[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Channel 04 : 2[67010] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Channel 06 : 3[67020] -> 1[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Channel 05 : 2[67010] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO Connected all trees
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO Connected all trees
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Channel 00 : 3[67020] -> 2[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Channel 01 : 3[67020] -> 2[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Channel 03 : 3[67020] -> 2[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Channel 04 : 3[67020] -> 2[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Channel 05 : 3[67020] -> 2[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Channel 07 : 3[67020] -> 2[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO Connected all trees
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO Connected all trees
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
iv-2udaavw4l02thdv8lcrl:44177:44325 [0] NCCL INFO comm 0x7f6710008fb0 rank 0 nranks 4 cudaDev 0 busId 65010 - Init COMPLETE
iv-2udaavw4l02thdv8lcrl:44178:44342 [1] NCCL INFO comm 0x7fadc0008fb0 rank 1 nranks 4 cudaDev 1 busId 65020 - Init COMPLETE
iv-2udaavw4l02thdv8lcrl:44179:44341 [2] NCCL INFO comm 0x7fb7b4008fb0 rank 2 nranks 4 cudaDev 2 busId 67010 - Init COMPLETE
iv-2udaavw4l02thdv8lcrl:44180:44344 [3] NCCL INFO comm 0x7f5c40008fb0 rank 3 nranks 4 cudaDev 3 busId 67020 - Init COMPLETE
iv-2udaavw4l02thdv8lcrl:44177:44177 [0] NCCL INFO Launch mode Parallel
>>> done with compiling and loading fused kernels. Compilation time: 4.829 seconds
time to initialize megatron (seconds): 5.197
[after megatron is initialized] datetime: 2022-06-15 11:45:06 
building GPT model ...
 > number of parameters on (tensor, pipeline) model parallel rank (1, 0): 178100224
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 178100224
> learning rate decay style: cosine
[after model, optimizer, and learning rate scheduler are built] datetime: 2022-06-15 11:45:06 
> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      107520
    validation: 5120
    test:       5120
> building train, validation, and test datasets for GPT ...
 > building dataset index ...
    reading sizes...
    reading pointers...
    reading document index...
    creating numpy buffer of mmap...
    creating memory view of numpy buffer...
 > finished creating indexed dataset in 0.003293 seconds
    number of documents: 1249934
 > dataset split:
    train:
     document indices in [0, 1186187) total of 1186187 documents
    validation:
     document indices in [1186187, 1248684) total of 62497 documents
    test:
     document indices in [1248684, 1249934) total of 1250 documents
iv-2udaavw4l02thdv8lcrl:44177:44373 [0] NCCL INFO Channel 00/04 :    0   1
iv-2udaavw4l02thdv8lcrl:44179:44374 [2] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0
iv-2udaavw4l02thdv8lcrl:44177:44373 [0] NCCL INFO Channel 01/04 :    0   1
iv-2udaavw4l02thdv8lcrl:44179:44374 [2] NCCL INFO Setting affinity for GPU 2 to 03ff,ffffffff
iv-2udaavw4l02thdv8lcrl:44177:44373 [0] NCCL INFO Channel 02/04 :    0   1
iv-2udaavw4l02thdv8lcrl:44177:44373 [0] NCCL INFO Channel 03/04 :    0   1
iv-2udaavw4l02thdv8lcrl:44177:44373 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
iv-2udaavw4l02thdv8lcrl:44177:44373 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff
iv-2udaavw4l02thdv8lcrl:44179:44374 [2] NCCL INFO Channel 00 : 1[67010] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44374 [2] NCCL INFO Channel 01 : 1[67010] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44374 [2] NCCL INFO Channel 02 : 1[67010] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44374 [2] NCCL INFO Channel 03 : 1[67010] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44373 [0] NCCL INFO Channel 00 : 0[65010] -> 1[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44373 [0] NCCL INFO Channel 01 : 0[65010] -> 1[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44373 [0] NCCL INFO Channel 02 : 0[65010] -> 1[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44373 [0] NCCL INFO Channel 03 : 0[65010] -> 1[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44374 [2] NCCL INFO Connected all rings
iv-2udaavw4l02thdv8lcrl:44177:44373 [0] NCCL INFO Connected all rings
iv-2udaavw4l02thdv8lcrl:44179:44374 [2] NCCL INFO Connected all trees
iv-2udaavw4l02thdv8lcrl:44177:44373 [0] NCCL INFO Connected all trees
iv-2udaavw4l02thdv8lcrl:44179:44374 [2] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
iv-2udaavw4l02thdv8lcrl:44179:44374 [2] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer
iv-2udaavw4l02thdv8lcrl:44177:44373 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
iv-2udaavw4l02thdv8lcrl:44177:44373 [0] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer
iv-2udaavw4l02thdv8lcrl:44179:44374 [2] NCCL INFO comm 0x7fb71c008fb0 rank 1 nranks 2 cudaDev 2 busId 67010 - Init COMPLETE
iv-2udaavw4l02thdv8lcrl:44177:44373 [0] NCCL INFO comm 0x7f6670008fb0 rank 0 nranks 2 cudaDev 0 busId 65010 - Init COMPLETE
iv-2udaavw4l02thdv8lcrl:44177:44177 [0] NCCL INFO Launch mode Parallel
NCCL version 2.10.3+cuda11.4
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 00/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 01/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 02/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 03/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 04/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 05/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 06/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 07/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 08/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 09/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 10/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 11/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 12/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 13/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 14/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 15/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 16/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 17/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 18/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 19/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 20/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 21/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 22/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 23/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 24/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 25/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 26/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 27/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 28/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 29/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 30/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Channel 31/32 :    0
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Setting affinity for GPU 2 to 03ff,ffffffff
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 00/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 01/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 02/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 03/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 04/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 05/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 06/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 07/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 08/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 09/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 10/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 11/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 12/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 13/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 14/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 15/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 16/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 17/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 18/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 19/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 20/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 21/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 22/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 23/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 24/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 25/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 26/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 27/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 28/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 29/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 30/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Channel 31/32 :    0
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Connected all rings
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO Connected all trees
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
iv-2udaavw4l02thdv8lcrl:44179:44382 [2] NCCL INFO comm 0x7fb718008fb0 rank 0 nranks 1 cudaDev 2 busId 67010 - Init COMPLETE
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Connected all rings
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO Connected all trees
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
iv-2udaavw4l02thdv8lcrl:44177:44381 [0] NCCL INFO comm 0x7f6674008fb0 rank 0 nranks 1 cudaDev 0 busId 65010 - Init COMPLETE
 > loading doc-idx mapping from /dataset/source/dataset/loss_compara_content_sentence_train_indexmap_107520ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /dataset/source/dataset/loss_compara_content_sentence_train_indexmap_107520ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /dataset/source/dataset/loss_compara_content_sentence_train_indexmap_107520ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.004 seconds
    total number of samples: 108847
    total number of epochs: 2
 > loading doc-idx mapping from /dataset/source/dataset/loss_compara_content_sentence_valid_indexmap_5120ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /dataset/source/dataset/loss_compara_content_sentence_valid_indexmap_5120ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /dataset/source/dataset/loss_compara_content_sentence_valid_indexmap_5120ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.002 seconds
    total number of samples: 5718
    total number of epochs: 2
 > loading doc-idx mapping from /dataset/source/dataset/loss_compara_content_sentence_test_indexmap_5120ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /dataset/source/dataset/loss_compara_content_sentence_test_indexmap_5120ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /dataset/source/dataset/loss_compara_content_sentence_test_indexmap_5120ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.002 seconds
    total number of samples: 5128
    total number of epochs: 102
> finished creating GPT datasets ...
iv-2udaavw4l02thdv8lcrl:44179:44388 [2] NCCL INFO Channel 00/04 :    0   1
iv-2udaavw4l02thdv8lcrl:44180:44389 [3] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0
iv-2udaavw4l02thdv8lcrl:44179:44388 [2] NCCL INFO Channel 01/04 :    0   1
iv-2udaavw4l02thdv8lcrl:44180:44389 [3] NCCL INFO Setting affinity for GPU 3 to 03ff,ffffffff
iv-2udaavw4l02thdv8lcrl:44179:44388 [2] NCCL INFO Channel 02/04 :    0   1
iv-2udaavw4l02thdv8lcrl:44179:44388 [2] NCCL INFO Channel 03/04 :    0   1
iv-2udaavw4l02thdv8lcrl:44179:44388 [2] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
iv-2udaavw4l02thdv8lcrl:44179:44388 [2] NCCL INFO Setting affinity for GPU 2 to 03ff,ffffffff
iv-2udaavw4l02thdv8lcrl:44177:44391 [0] NCCL INFO Channel 00/02 :    0   1
iv-2udaavw4l02thdv8lcrl:44178:44392 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
iv-2udaavw4l02thdv8lcrl:44178:44392 [1] NCCL INFO Setting affinity for GPU 1 to 03ff,ffffffff
iv-2udaavw4l02thdv8lcrl:44177:44391 [0] NCCL INFO Channel 01/02 :    0   1
iv-2udaavw4l02thdv8lcrl:44177:44391 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
iv-2udaavw4l02thdv8lcrl:44177:44391 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff
iv-2udaavw4l02thdv8lcrl:44179:44388 [2] NCCL INFO Channel 00 : 0[67010] -> 1[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44389 [3] NCCL INFO Channel 00 : 1[67020] -> 0[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44388 [2] NCCL INFO Channel 01 : 0[67010] -> 1[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44391 [0] NCCL INFO Channel 00 : 0[65010] -> 1[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44392 [1] NCCL INFO Channel 00 : 1[65020] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44389 [3] NCCL INFO Channel 01 : 1[67020] -> 0[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44388 [2] NCCL INFO Channel 02 : 0[67010] -> 1[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44392 [1] NCCL INFO Channel 01 : 1[65020] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44391 [0] NCCL INFO Channel 01 : 0[65010] -> 1[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44389 [3] NCCL INFO Channel 02 : 1[67020] -> 0[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:44388 [2] NCCL INFO Channel 03 : 0[67010] -> 1[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44389 [3] NCCL INFO Channel 03 : 1[67020] -> 0[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:44391 [0] NCCL INFO Connected all rings
iv-2udaavw4l02thdv8lcrl:44177:44391 [0] NCCL INFO Connected all trees
iv-2udaavw4l02thdv8lcrl:44177:44391 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
iv-2udaavw4l02thdv8lcrl:44177:44391 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
iv-2udaavw4l02thdv8lcrl:44178:44392 [1] NCCL INFO Connected all rings
iv-2udaavw4l02thdv8lcrl:44178:44392 [1] NCCL INFO Connected all trees
iv-2udaavw4l02thdv8lcrl:44178:44392 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
iv-2udaavw4l02thdv8lcrl:44178:44392 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
iv-2udaavw4l02thdv8lcrl:44178:44392 [1] NCCL INFO comm 0x7fad2c008fb0 rank 1 nranks 2 cudaDev 1 busId 65020 - Init COMPLETE
iv-2udaavw4l02thdv8lcrl:44177:44391 [0] NCCL INFO comm 0x7f6668008fb0 rank 0 nranks 2 cudaDev 0 busId 65010 - Init COMPLETE
iv-2udaavw4l02thdv8lcrl:44177:44177 [0] NCCL INFO Launch mode Parallel
iv-2udaavw4l02thdv8lcrl:44179:44388 [2] NCCL INFO Connected all rings
iv-2udaavw4l02thdv8lcrl:44179:44388 [2] NCCL INFO Connected all trees
iv-2udaavw4l02thdv8lcrl:44179:44388 [2] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
iv-2udaavw4l02thdv8lcrl:44179:44388 [2] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer
iv-2udaavw4l02thdv8lcrl:44180:44389 [3] NCCL INFO Connected all rings
iv-2udaavw4l02thdv8lcrl:44180:44389 [3] NCCL INFO Connected all trees
iv-2udaavw4l02thdv8lcrl:44180:44389 [3] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
iv-2udaavw4l02thdv8lcrl:44180:44389 [3] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer
iv-2udaavw4l02thdv8lcrl:44179:44388 [2] NCCL INFO comm 0x7fb70c008fb0 rank 0 nranks 2 cudaDev 2 busId 67010 - Init COMPLETE
iv-2udaavw4l02thdv8lcrl:44180:44389 [3] NCCL INFO comm 0x7f5ba8008fb0 rank 1 nranks 2 cudaDev 3 busId 67020 - Init COMPLETE
iv-2udaavw4l02thdv8lcrl:44179:44179 [2] NCCL INFO Launch mode Parallel
[W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[after dataloaders are built] datetime: 2022-06-15 11:45:06 
done with setup ...
time (ms) | model-and-optimizer-setup: 113.63 | train/valid/test-data-iterators-setup: 346.11
training ...
[before the start of training step] datetime: 2022-06-15 11:45:06 
/dataset/workspace/Megatron-LM/megatron/model/transformer.py:536: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at  /opt/pytorch/pytorch/aten/src/ATen/core/LegacyTypeDispatch.h:74.)
  output = bias_dropout_add_func(
/dataset/workspace/Megatron-LM/megatron/model/transformer.py:536: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at  /opt/pytorch/pytorch/aten/src/ATen/core/LegacyTypeDispatch.h:74.)
  output = bias_dropout_add_func(
/dataset/workspace/Megatron-LM/megatron/model/transformer.py:536: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at  /opt/pytorch/pytorch/aten/src/ATen/core/LegacyTypeDispatch.h:74.)
  output = bias_dropout_add_func(
/dataset/workspace/Megatron-LM/megatron/model/transformer.py:536: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at  /opt/pytorch/pytorch/aten/src/ATen/core/LegacyTypeDispatch.h:74.)
  output = bias_dropout_add_func(
NCCL version 2.10.3+cuda11.4
iv-2udaavw4l02thdv8lcrl:44178:44954 [1] NCCL INFO Channel 00/04 :    0   1
iv-2udaavw4l02thdv8lcrl:44180:44955 [3] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0
iv-2udaavw4l02thdv8lcrl:44178:44954 [1] NCCL INFO Channel 01/04 :    0   1
iv-2udaavw4l02thdv8lcrl:44180:44955 [3] NCCL INFO Setting affinity for GPU 3 to 03ff,ffffffff
iv-2udaavw4l02thdv8lcrl:44178:44954 [1] NCCL INFO Channel 02/04 :    0   1
iv-2udaavw4l02thdv8lcrl:44178:44954 [1] NCCL INFO Channel 03/04 :    0   1
iv-2udaavw4l02thdv8lcrl:44178:44954 [1] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
iv-2udaavw4l02thdv8lcrl:44178:44954 [1] NCCL INFO Setting affinity for GPU 1 to 03ff,ffffffff
iv-2udaavw4l02thdv8lcrl:44180:44955 [3] NCCL INFO Channel 00 : 1[67020] -> 0[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44954 [1] NCCL INFO Channel 00 : 0[65020] -> 1[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44955 [3] NCCL INFO Channel 01 : 1[67020] -> 0[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44954 [1] NCCL INFO Channel 01 : 0[65020] -> 1[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44955 [3] NCCL INFO Channel 02 : 1[67020] -> 0[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44954 [1] NCCL INFO Channel 02 : 0[65020] -> 1[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44955 [3] NCCL INFO Channel 03 : 1[67020] -> 0[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:44954 [1] NCCL INFO Channel 03 : 0[65020] -> 1[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:44955 [3] NCCL INFO Connected all rings
iv-2udaavw4l02thdv8lcrl:44180:44955 [3] NCCL INFO Connected all trees
iv-2udaavw4l02thdv8lcrl:44180:44955 [3] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
iv-2udaavw4l02thdv8lcrl:44180:44955 [3] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer
iv-2udaavw4l02thdv8lcrl:44178:44954 [1] NCCL INFO Connected all rings
iv-2udaavw4l02thdv8lcrl:44178:44954 [1] NCCL INFO Connected all trees
iv-2udaavw4l02thdv8lcrl:44178:44954 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
iv-2udaavw4l02thdv8lcrl:44178:44954 [1] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer
iv-2udaavw4l02thdv8lcrl:44178:44954 [1] NCCL INFO comm 0x7faa8c008fb0 rank 0 nranks 2 cudaDev 1 busId 65020 - Init COMPLETE
iv-2udaavw4l02thdv8lcrl:44180:44955 [3] NCCL INFO comm 0x7f5900008fb0 rank 1 nranks 2 cudaDev 3 busId 67020 - Init COMPLETE
iv-2udaavw4l02thdv8lcrl:44178:44178 [1] NCCL INFO Launch mode Parallel
iv-2udaavw4l02thdv8lcrl:44177:45034 [0] NCCL INFO Channel 00/02 :    0   1
iv-2udaavw4l02thdv8lcrl:44178:45035 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
iv-2udaavw4l02thdv8lcrl:44177:45034 [0] NCCL INFO Channel 01/02 :    0   1
iv-2udaavw4l02thdv8lcrl:44178:45035 [1] NCCL INFO Setting affinity for GPU 1 to 03ff,ffffffff
iv-2udaavw4l02thdv8lcrl:44177:45034 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
iv-2udaavw4l02thdv8lcrl:44177:45034 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff
iv-2udaavw4l02thdv8lcrl:44178:45035 [1] NCCL INFO Channel 00 : 1[65020] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:45034 [0] NCCL INFO Channel 00 : 0[65010] -> 1[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:45033 [2] NCCL INFO Channel 00/04 :    0   1
iv-2udaavw4l02thdv8lcrl:44180:45036 [3] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0
iv-2udaavw4l02thdv8lcrl:44179:45033 [2] NCCL INFO Channel 01/04 :    0   1
iv-2udaavw4l02thdv8lcrl:44179:45033 [2] NCCL INFO Channel 02/04 :    0   1
iv-2udaavw4l02thdv8lcrl:44180:45036 [3] NCCL INFO Setting affinity for GPU 3 to 03ff,ffffffff
iv-2udaavw4l02thdv8lcrl:44179:45033 [2] NCCL INFO Channel 03/04 :    0   1
iv-2udaavw4l02thdv8lcrl:44179:45033 [2] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
iv-2udaavw4l02thdv8lcrl:44179:45033 [2] NCCL INFO Setting affinity for GPU 2 to 03ff,ffffffff
iv-2udaavw4l02thdv8lcrl:44178:45035 [1] NCCL INFO Channel 01 : 1[65020] -> 0[65010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44177:45034 [0] NCCL INFO Channel 01 : 0[65010] -> 1[65020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44178:45035 [1] NCCL INFO Connected all rings
iv-2udaavw4l02thdv8lcrl:44178:45035 [1] NCCL INFO Connected all trees
iv-2udaavw4l02thdv8lcrl:44178:45035 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
iv-2udaavw4l02thdv8lcrl:44178:45035 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
iv-2udaavw4l02thdv8lcrl:44177:45034 [0] NCCL INFO Connected all rings
iv-2udaavw4l02thdv8lcrl:44177:45034 [0] NCCL INFO Connected all trees
iv-2udaavw4l02thdv8lcrl:44177:45034 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
iv-2udaavw4l02thdv8lcrl:44177:45034 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
iv-2udaavw4l02thdv8lcrl:44178:45035 [1] NCCL INFO comm 0x7faa64008fb0 rank 1 nranks 2 cudaDev 1 busId 65020 - Init COMPLETE
iv-2udaavw4l02thdv8lcrl:44177:45034 [0] NCCL INFO comm 0x7f6384008fb0 rank 0 nranks 2 cudaDev 0 busId 65010 - Init COMPLETE
iv-2udaavw4l02thdv8lcrl:44177:44177 [0] NCCL INFO Launch mode Parallel
iv-2udaavw4l02thdv8lcrl:44179:45033 [2] NCCL INFO Channel 00 : 0[67010] -> 1[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:45036 [3] NCCL INFO Channel 00 : 1[67020] -> 0[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:45033 [2] NCCL INFO Channel 01 : 0[67010] -> 1[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:45036 [3] NCCL INFO Channel 01 : 1[67020] -> 0[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:45033 [2] NCCL INFO Channel 02 : 0[67010] -> 1[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:45036 [3] NCCL INFO Channel 02 : 1[67020] -> 0[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44179:45033 [2] NCCL INFO Channel 03 : 0[67010] -> 1[67020] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:45036 [3] NCCL INFO Channel 03 : 1[67020] -> 0[67010] via P2P/IPC
iv-2udaavw4l02thdv8lcrl:44180:45036 [3] NCCL INFO Connected all rings
iv-2udaavw4l02thdv8lcrl:44180:45036 [3] NCCL INFO Connected all trees
iv-2udaavw4l02thdv8lcrl:44180:45036 [3] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
iv-2udaavw4l02thdv8lcrl:44180:45036 [3] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer
iv-2udaavw4l02thdv8lcrl:44179:45033 [2] NCCL INFO Connected all rings
iv-2udaavw4l02thdv8lcrl:44179:45033 [2] NCCL INFO Connected all trees
iv-2udaavw4l02thdv8lcrl:44179:45033 [2] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
iv-2udaavw4l02thdv8lcrl:44179:45033 [2] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer
iv-2udaavw4l02thdv8lcrl:44180:45036 [3] NCCL INFO comm 0x7f58dc008fb0 rank 1 nranks 2 cudaDev 3 busId 67020 - Init COMPLETE
iv-2udaavw4l02thdv8lcrl:44179:45033 [2] NCCL INFO comm 0x7fb420008fb0 rank 0 nranks 2 cudaDev 2 busId 67010 - Init COMPLETE
iv-2udaavw4l02thdv8lcrl:44179:44179 [2] NCCL INFO Launch mode Parallel
 iteration      100/     210 | consumed samples:        51200 | elapsed time per iteration (ms): 13875.0 | tpt: 36.9 samples/s | global batch size:   512 | lm loss: 1.000429E+01 | loss scale: 524288.0 | grad norm: 1.443 | number of skipped iterations:  14 | number of nan iterations:   0 |
[Rank 1] (after 100 iterations) memory (MB) | allocated: 3403.0478515625 | max allocated: 8384.1923828125 | reserved: 12764.0 | max reserved: 12764.0
[Rank 0] (after 100 iterations) memory (MB) | allocated: 3403.0478515625 | max allocated: 8384.1923828125 | reserved: 12508.0 | max reserved: 12508.0
time (ms) | forward-compute: 5507.15 | backward-compute: 8164.66 | backward-params-all-reduce: 179.91 | backward-embedding-all-reduce: 0.03 | optimizer-copy-to-main-grad: 3.08 | optimizer-unscale-and-check-inf: 4.37 | optimizer-clip-main-grad: 3.64 | optimizer-copy-main-to-model-params: 2.64 | optimizer: 20.12 | batch-generator: 5.41
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2022/06/15 12:08:28.162, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 56 %, 32510 MiB, 18634 MiB, 13876 MiB
2022/06/15 12:08:28.163, Tesla V100-SXM2-32GB, 470.57.02, 58 %, 36 %, 32510 MiB, 18378 MiB, 14132 MiB
2022/06/15 12:08:28.163, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 56 %, 32510 MiB, 18634 MiB, 13876 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2022/06/15 12:08:28.165, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 11 %, 32510 MiB, 18258 MiB, 14252 MiB
2022/06/15 12:08:28.165, Tesla V100-SXM2-32GB, 470.57.02, 58 %, 36 %, 32510 MiB, 18378 MiB, 14132 MiB
2022/06/15 12:08:28.167, Tesla V100-SXM2-32GB, 470.57.02, 80 %, 11 %, 32510 MiB, 18434 MiB, 14076 MiB
2022/06/15 12:08:28.166, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 56 %, 32510 MiB, 18634 MiB, 13876 MiB
2022/06/15 12:08:28.167, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 11 %, 32510 MiB, 18258 MiB, 14252 MiB
2022/06/15 12:08:28.169, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2022/06/15 12:08:28.169, Tesla V100-SXM2-32GB, 470.57.02, 58 %, 36 %, 32510 MiB, 18378 MiB, 14132 MiB
2022/06/15 12:08:28.169, Tesla V100-SXM2-32GB, 470.57.02, 80 %, 11 %, 32510 MiB, 18434 MiB, 14076 MiB
2022/06/15 12:08:28.171, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB
2022/06/15 12:08:28.171, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 11 %, 32510 MiB, 18258 MiB, 14252 MiB
2022/06/15 12:08:28.171, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB
2022/06/15 12:08:28.171, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 56 %, 32510 MiB, 18634 MiB, 13876 MiB
2022/06/15 12:08:28.174, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB
2022/06/15 12:08:28.174, Tesla V100-SXM2-32GB, 470.57.02, 80 %, 11 %, 32510 MiB, 18434 MiB, 14076 MiB
2022/06/15 12:08:28.174, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB
2022/06/15 12:08:28.174, Tesla V100-SXM2-32GB, 470.57.02, 58 %, 36 %, 32510 MiB, 18378 MiB, 14132 MiB
2022/06/15 12:08:28.176, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB
2022/06/15 12:08:28.177, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB
2022/06/15 12:08:28.177, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB
2022/06/15 12:08:28.177, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 11 %, 32510 MiB, 18258 MiB, 14252 MiB
2022/06/15 12:08:28.179, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB
2022/06/15 12:08:28.180, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB
2022/06/15 12:08:28.180, Tesla V100-SXM2-32GB, 470.57.02, 80 %, 11 %, 32510 MiB, 18434 MiB, 14076 MiB
2022/06/15 12:08:28.182, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB
2022/06/15 12:08:28.182, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB
2022/06/15 12:08:28.184, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB
2022/06/15 12:08:28.185, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB
2022/06/15 12:08:28.187, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB
2022/06/15 12:08:28.190, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB
 iteration      200/     210 | consumed samples:       102400 | elapsed time per iteration (ms): 13858.2 | tpt: 36.9 samples/s | global batch size:   512 | lm loss: 8.673329E+00 | loss scale: 262144.0 | grad norm: 2.390 | number of skipped iterations:   1 | number of nan iterations:   0 |
time (ms) | forward-compute: 5496.01 | backward-compute: 8157.56 | backward-params-all-reduce: 180.41 | backward-embedding-all-reduce: 0.03 | optimizer-copy-to-main-grad: 3.07 | optimizer-unscale-and-check-inf: 2.51 | optimizer-clip-main-grad: 4.17 | optimizer-copy-main-to-model-params: 3.03 | optimizer: 20.07 | batch-generator: 5.32
[after training is done] datetime: 2022-06-15 12:33:38 
------------------------------------------------------------------------------------------------------------------
 validation loss at the end of training for val data | lm loss value: 7.902287E+00 | lm loss PPL: 2.703457E+03 | 
------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------
 validation loss at the end of training for test data | lm loss value: 7.733377E+00 | lm loss PPL: 2.283299E+03 | 
-------------------------------------------------------------------------------------------------------------------
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.00033092498779296875 seconds
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "44177", "role": "default", "hostname": "iv-2udaavw4l02thdv8lcrl", "state": "SUCCEEDED", "total_run_time": 3002, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "44178", "role": "default", "hostname": "iv-2udaavw4l02thdv8lcrl", "state": "SUCCEEDED", "total_run_time": 3002, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [1], \"role_rank\": [1], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 2, "group_rank": 0, "worker_id": "44179", "role": "default", "hostname": "iv-2udaavw4l02thdv8lcrl", "state": "SUCCEEDED", "total_run_time": 3002, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [2], \"role_rank\": [2], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 3, "group_rank": 0, "worker_id": "44180", "role": "default", "hostname": "iv-2udaavw4l02thdv8lcrl", "state": "SUCCEEDED", "total_run_time": 3002, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [3], \"role_rank\": [3], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "iv-2udaavw4l02thdv8lcrl", "state": "SUCCEEDED", "total_run_time": 3002, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}}
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************