The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases. Please read local_rank from `os.environ('LOCAL_RANK')` instead. INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : pretrain_bert.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:6000 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {} INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_q3g4xepq/none_5m50g26j INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=6000 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1] INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_q3g4xepq/none_5m50g26j/attempt_0/0/error.json using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 using torch.float16 for parameters ... Persistent fused layer norm kernel is supported from pytorch v1.11 (nvidia pytorch container paired with v1.11). Defaulting to no_persist_layer_norm=True ------------------------ arguments ------------------------ accumulate_allreduce_grads_in_fp32 .............. False activations_checkpoint_method ................... uniform activations_checkpoint_num_layers ............... 1 adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.999 adam_eps ........................................ 1e-08 adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_query_key_layer_scaling ................... True apply_residual_connection_post_layernorm ........ False attention_dropout ............................... 0.1 attention_softmax_in_fp32 ....................... False bert_binary_head ................................ True bert_load ....................................... None bf16 ............................................ False bias_dropout_fusion ............................. True bias_gelu_fusion ................................ True biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False block_data_path ................................. None classes_fraction ................................ 1.0 clip_grad ....................................... 1.0 consumed_train_samples .......................... 0 consumed_valid_samples .......................... 0 data_impl ....................................... mmap data_parallel_random_init ....................... False data_parallel_size .............................. 1 data_path ....................................... ['/dataset/source/dataset/loss_compara_content_sentence'] data_per_class_fraction ......................... 1.0 data_sharding ................................... True dataloader_type ................................. single DDP_impl ........................................ local decoder_seq_length .............................. None distribute_checkpointed_activations ............. False distributed_backend ............................. nccl embedding_path .................................. None empty_unused_memory_level ....................... 0 encoder_seq_length .............................. 512 eod_mask_loss ................................... False eval_interval ................................... 1000 eval_iters ...................................... 10 evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... None exit_signal_handler ............................. False ffn_hidden_size ................................. 4096 finetune ........................................ False fp16 ............................................ True fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False global_batch_size ............................... 1024 hidden_dropout .................................. 0.1 hidden_size ..................................... 1024 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_h ........................................... 224 img_w ........................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 inference_batch_times_seqlen_threshold .......... 512 init_method_std ................................. 0.02 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 kv_channels ..................................... 64 layernorm_epsilon ............................... 1e-05 lazy_mpu_init ................................... None load ............................................ None local_rank ...................................... 0 log_batch_size_to_tensorboard ................... False log_interval .................................... 100 log_learning_rate_to_tensorboard ................ True log_loss_scale_to_tensorboard ................... True log_memory_to_tensorboard ....................... False log_num_zeros_in_grad ........................... False log_params_norm ................................. False log_timers_to_tensorboard ....................... False log_validation_ppl_to_tensorboard ............... False log_world_size_to_tensorboard ................... False loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. 0.0001 lr_decay_iters .................................. 990000 lr_decay_samples ................................ None lr_decay_style .................................. linear lr_warmup_fraction .............................. 0.01 lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 0 make_vocab_size_divisible_by .................... 128 mask_prob ....................................... 0.15 masked_softmax_fusion ........................... True max_position_embeddings ......................... 512 merge_file ...................................... None micro_batch_size ................................ 128 min_loss_scale .................................. 1.0 min_lr .......................................... 1e-05 mmap_warmup ..................................... False no_async_tensor_model_parallel_allreduce ........ False no_load_optim ................................... None no_load_rng ..................................... None no_persist_layer_norm ........................... True no_save_optim ................................... None no_save_rng ..................................... None num_attention_heads ............................. 16 num_channels .................................... 3 num_classes ..................................... 1000 num_layers ...................................... 24 num_layers_per_virtual_pipeline_stage ........... None num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False optimizer ....................................... adam override_lr_scheduler ........................... False params_dtype .................................... torch.float16 patch_dim ....................................... 16 pipeline_model_parallel_size .................... 1 pipeline_model_parallel_split_rank .............. None query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 reset_attention_mask ............................ False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 sample_rate ..................................... 1.0 save ............................................ None save_interval ................................... 10000 scatter_gather_tensors_in_pipeline .............. True seed ............................................ 1234 seq_length ...................................... 512 sgd_momentum .................................... 0.9 short_seq_prob .................................. 0.1 split ........................................... 949,50,1 tensor_model_parallel_size ...................... 1 tensorboard_dir ................................. None tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 1000 titles_data_path ................................ None tokenizer_type .................................. BertWordPieceLowerCase train_iters ..................................... 210 train_samples ................................... None use_checkpoint_lr_scheduler ..................... False use_contiguous_buffers_in_local_ddp ............. True use_cpu_initialization .......................... None use_one_sent_docs ............................... False virtual_pipeline_model_parallel_size ............ None vocab_extra_ids ................................. 0 vocab_file ...................................... /dataset/source/dataset/bert-base-chinese-vocab.txt weight_decay .................................... 0.01 world_size ...................................... 1 -------------------- end of arguments --------------------- setting number of micro-batches to constant 8 > building BertWordPieceLowerCase tokenizer ... > padded vocab (size: 21130) with 118 dummy tokens (new size: 21248) > initializing torch distributed ... > initializing tensor model parallel with size 1 > initializing pipeline model parallel with size 1 > setting random seeds to 1234 ... > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 > compiling dataset index builder ... make: Entering directory '/dataset/workspace/Megatron-LM/megatron/data' make: Nothing to be done for 'default'. make: Leaving directory '/dataset/workspace/Megatron-LM/megatron/data' >>> done with dataset index builder. Compilation time: 0.035 seconds > compiling and loading fused kernels ... Detected CUDA files, patching ldflags Emitting ninja build file /dataset/workspace/Megatron-LM/megatron/fused_kernels/build/build.ninja... Building extension module scaled_upper_triang_masked_softmax_cuda... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module scaled_upper_triang_masked_softmax_cuda... Detected CUDA files, patching ldflags Emitting ninja build file /dataset/workspace/Megatron-LM/megatron/fused_kernels/build/build.ninja... Building extension module scaled_masked_softmax_cuda... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module scaled_masked_softmax_cuda... Detected CUDA files, patching ldflags Emitting ninja build file /dataset/workspace/Megatron-LM/megatron/fused_kernels/build/build.ninja... Building extension module scaled_softmax_cuda... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module scaled_softmax_cuda... Detected CUDA files, patching ldflags Emitting ninja build file /dataset/workspace/Megatron-LM/megatron/fused_kernels/build/build.ninja... Building extension module fused_mix_prec_layer_norm_cuda... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_mix_prec_layer_norm_cuda... [W ProcessGroupNCCL.cpp:1671] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. iv-2udaavw4l02thdv8lcrl:60709:60709 [0] NCCL INFO Bootstrap : Using eth0:192.168.11.42<0> iv-2udaavw4l02thdv8lcrl:60709:60709 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so iv-2udaavw4l02thdv8lcrl:60709:60709 [0] NCCL INFO P2P plugin IBext iv-2udaavw4l02thdv8lcrl:60709:60709 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. iv-2udaavw4l02thdv8lcrl:60709:60709 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.11.42<0> iv-2udaavw4l02thdv8lcrl:60709:60709 [0] NCCL INFO Using network IBext NCCL version 2.10.3+cuda11.4 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23. iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7. iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 00/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 01/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 02/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 03/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 04/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 05/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 06/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 07/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 08/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 09/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 10/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 11/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 12/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 13/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 14/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 15/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 16/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 17/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 18/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 19/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 20/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 21/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 22/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 23/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 24/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 25/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 26/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 27/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 28/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 29/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 30/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Channel 31/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Connected all rings iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO Connected all trees iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer iv-2udaavw4l02thdv8lcrl:60709:60798 [0] NCCL INFO comm 0x7f8ba4008fb0 rank 0 nranks 1 cudaDev 0 busId 65010 - Init COMPLETE >>> done with compiling and loading fused kernels. Compilation time: 2.541 seconds time to initialize megatron (seconds): 2.610 [after megatron is initialized] datetime: 2022-06-15 13:16:51 building BERT model ... > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 326720258 > learning rate decay style: linear [after model, optimizer, and learning rate scheduler are built] datetime: 2022-06-15 13:16:51 > building train, validation, and test datasets ... > datasets target sizes (minimum size): train: 215040 validation: 10240 test: 10240 > building train, validation, and test datasets for BERT ... > building dataset index ... reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... > finished creating indexed dataset in 0.002801 seconds > indexed dataset stats: number of documents: 50000 number of sentences: 1249934 > dataset split: train: document indices in [0, 47450) total of 47450 documents sentence indices in [0, 1188464) total of 1188464 sentences validation: document indices in [47450, 49950) total of 2500 documents sentence indices in [1188464, 1248643) total of 60179 sentences test: document indices in [49950, 50000) total of 50 documents sentence indices in [1248643, 1249934) total of 1291 sentences iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 00/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 01/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 02/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 03/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 04/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 05/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 06/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 07/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 08/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 09/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 10/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 11/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 12/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 13/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 14/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 15/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 16/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 17/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 18/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 19/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 20/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 21/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 22/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 23/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 24/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 25/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 26/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 27/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 28/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 29/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 30/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Channel 31/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Connected all rings iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO Connected all trees iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer iv-2udaavw4l02thdv8lcrl:60709:60802 [0] NCCL INFO comm 0x7f8aec008fb0 rank 0 nranks 1 cudaDev 0 busId 65010 - Init COMPLETE iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 00/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 01/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 02/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 03/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 04/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 05/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 06/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 07/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 08/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 09/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 10/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 11/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 12/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 13/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 14/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 15/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 16/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 17/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 18/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 19/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 20/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 21/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 22/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 23/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 24/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 25/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 26/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 27/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 28/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 29/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 30/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Channel 31/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Connected all rings iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO Connected all trees iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer iv-2udaavw4l02thdv8lcrl:60709:60806 [0] NCCL INFO comm 0x7f8ae4008fb0 rank 0 nranks 1 cudaDev 0 busId 65010 - Init COMPLETE > loading indexed mapping from /dataset/source/dataset/loss_compara_content_sentence_train_indexmap_215040mns_509msl_0.10ssp_1234s.npy loaded indexed file in 0.002 seconds total number of samples: 226136 > loading indexed mapping from /dataset/source/dataset/loss_compara_content_sentence_valid_indexmap_10240mns_509msl_0.10ssp_1234s.npy loaded indexed file in 0.001 seconds total number of samples: 11758 > loading indexed mapping from /dataset/source/dataset/loss_compara_content_sentence_test_indexmap_10240mns_509msl_0.10ssp_1234s.npy loaded indexed file in 0.000 seconds total number of samples: 10295 > finished creating BERT datasets ... iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 00/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 01/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 02/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 03/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 04/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 05/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 06/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 07/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 08/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 09/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 10/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 11/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 12/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 13/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 14/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 15/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 16/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 17/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 18/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 19/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 20/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 21/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 22/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 23/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 24/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 25/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 26/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 27/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 28/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 29/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 30/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Channel 31/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Connected all rings iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO Connected all trees iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer iv-2udaavw4l02thdv8lcrl:60709:60810 [0] NCCL INFO comm 0x7f8ae8008fb0 rank 0 nranks 1 cudaDev 0 busId 65010 - Init COMPLETE [W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:99] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [after dataloaders are built] datetime: 2022-06-15 13:16:52 done with setup ... time (ms) | model-and-optimizer-setup: 133.34 | train/valid/test-data-iterators-setup: 316.15 training ... [before the start of training step] datetime: 2022-06-15 13:16:52 /dataset/workspace/Megatron-LM/megatron/model/transformer.py:536: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/LegacyTypeDispatch.h:74.) output = bias_dropout_add_func( iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 00/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 01/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 02/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 03/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 04/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 05/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 06/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 07/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 08/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 09/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 10/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 11/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 12/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 13/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 14/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 15/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 16/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 17/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 18/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 19/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 20/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 21/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 22/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 23/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 24/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 25/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 26/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 27/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 28/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 29/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 30/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Channel 31/32 : 0 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Connected all rings iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO Connected all trees iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer iv-2udaavw4l02thdv8lcrl:60709:61230 [0] NCCL INFO comm 0x7f85e8008fb0 rank 0 nranks 1 cudaDev 0 busId 65010 - Init COMPLETE iteration 100/ 210 | consumed samples: 102400 | elapsed time per iteration (ms): 30479.0 | tpt: 33.6 samples/s | global batch size: 1024 | lm loss: 9.559973E+00 | sop loss: 6.977597E-01 | loss scale: 262144.0 | grad norm: 3.606 | number of skipped iterations: 15 | number of nan iterations: 0 | [Rank 0] (after 100 iterations) memory (MB) | allocated: 6235.033203125 | max allocated: 16772.453125 | reserved: 18232.0 | max reserved: 18232.0 time (ms) | forward-compute: 8306.88 | backward-compute: 22133.07 | backward-params-all-reduce: 2.22 | backward-embedding-all-reduce: 0.04 | optimizer-copy-to-main-grad: 4.31 | optimizer-unscale-and-check-inf: 6.12 | optimizer-clip-main-grad: 5.60 | optimizer-copy-main-to-model-params: 3.99 | optimizer: 31.45 | batch-generator: 58.10 timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/06/15 14:08:10.665, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 55 %, 32510 MiB, 13150 MiB, 19360 MiB 2022/06/15 14:08:10.665, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB 2022/06/15 14:08:10.666, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB 2022/06/15 14:08:10.667, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB 2022/06/15 14:08:10.667, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB 2022/06/15 14:08:10.668, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB 2022/06/15 14:08:10.669, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB 2022/06/15 14:08:10.669, Tesla V100-SXM2-32GB, 470.57.02, 0 %, 0 %, 32510 MiB, 32507 MiB, 3 MiB iteration 200/ 210 | consumed samples: 204800 | elapsed time per iteration (ms): 30459.0 | tpt: 33.6 samples/s | global batch size: 1024 | lm loss: 8.948593E+00 | sop loss: 6.938206E-01 | loss scale: 262144.0 | grad norm: 1.656 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward-compute: 8269.29 | backward-compute: 22148.43 | backward-params-all-reduce: 2.22 | backward-embedding-all-reduce: 0.03 | optimizer-copy-to-main-grad: 4.30 | optimizer-unscale-and-check-inf: 4.23 | optimizer-clip-main-grad: 6.58 | optimizer-copy-main-to-model-params: 4.69 | optimizer: 33.12 | batch-generator: 34.65 [after training is done] datetime: 2022-06-15 15:03:30 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ validation loss at the end of training for val data | lm loss value: 8.708968E+00 | lm loss PPL: 6.056989E+03 | sop loss value: 6.929042E-01 | sop loss PPL: 1.999514E+00 | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- validation loss at the end of training for test data | lm loss value: 8.665636E+00 | lm loss PPL: 5.800133E+03 | sop loss value: 6.943576E-01 | sop loss PPL: 2.002422E+00 | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish. INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0003333091735839844 seconds {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "60709", "role": "default", "hostname": "iv-2udaavw4l02thdv8lcrl", "state": "SUCCEEDED", "total_run_time": 6556, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "iv-2udaavw4l02thdv8lcrl", "state": "SUCCEEDED", "total_run_time": 6556, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}}