loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: loaded library: loaded library: loaded library: loaded library: loaded library: loaded library: loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 /usr/lib/x86_64-linux-gnu/libibverbs.so.1 /usr/lib/x86_64-linux-gnu/libibverbs.so.1 /usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1/usr/lib/x86_64-linux-gnu/libibverbs.so.1 /usr/lib/x86_64-linux-gnu/libibverbs.so.1 /usr/lib/x86_64-linux-gnu/libibverbs.so.1 W20220706 14:03:26.057368 3615 rpc_client.cpp:190] LoadServer 198.18.8.34 Failed at 0 times error_code 14 error_message failed to connect to all addresses [07/06 14:03:37 libai]: Rank of current process: 0. World size: 32 [07/06 14:03:37 libai]: Command line arguments: Namespace(config_file='configs/gpt2_nl24_nah16_hs1024.py', eval_only=False, fast_dev_run=False, opts=['model.cfg.num_layers=24', 'train.dist.pipeline_num_layers=24', 'train.train_micro_batch_size=4', 'train.global_batch_size=128', 'train.dist.tensor_parallel_size=1', 'train.dist.pipeline_parallel_size=1', 'train.amp.enabled=true', 'train.activation_checkpoint.enabled=false', 'train.train_iter=220', 'train.log_period=100', 'train.output_dir=test_logs/01b1d32/4n8g/LibAI_gpt2_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb4_gb128_4n8g_20220706_140324618820537'], resume=False) [07/06 14:03:37 libai]: Contents of args.config_file=configs/gpt2_nl24_nah16_hs1024.py: from libai.config import LazyCall from libai.evaluation import PPLEvaluator from libai.config import LazyCall from .common.models.gpt import pretrain_model as model from .common.train import train from .common.optim import optim from .common.data.gpt_dataset import dataloader, tokenization from .common.models.graph import graph #vocab_file = "/workspace/dataset/gpt2-vocab.json" #merges_file = "/workspace/dataset/gpt2-merges.txt" #data_prefix = "/workspace/dataset/loss_compara_content_sentence" vocab_file = "/dataset/source/dataset/gpt2-vocab.json" merges_file = "/dataset/source/dataset/gpt2-merges.txt" data_prefix = "/dataset/source/dataset/loss_compara_content_sentence" tokenization.tokenizer.vocab_file = vocab_file tokenization.tokenizer.merges_file = merges_file dataloader.train.dataset[0].data_prefix = data_prefix dataloader.train.dataset[0].indexed_dataset.data_prefix = data_prefix # dataloader.train.num_workers = 4 # GPT-2 model config model.cfg.embedding_dropout_prob = 0.1 model.cfg.attention_dropout_prob = 0.1 model.cfg.num_attention_heads = 16 model.cfg.hidden_size = 1024 model.cfg.ffn_hidden_size = 4096 #model.cfg.num_layers = 24 model.cfg.max_seq_length = 1024 #model.cfg.initializer_range = 0.006 # model.cfg.bias_dropout_fusion = True # model.cfg.bias_gelu_fusion = True # model.cfg.scale_mask_softmax_fusion = True train.input_placement_device = "cpu" for ds in dataloader.train.dataset:  ds.max_seq_length = model.cfg.max_seq_length optim.lr = 1.5e-4 #train.dist.pipeline_num_layers = model.cfg.num_layers train.test_micro_batch_size = 4 train.evaluation.evaluator = LazyCall(PPLEvaluator)() train.evaluation.enabled = False train.evaluation.eval_iter = 30 [07/06 14:03:37 libai]: Full config saved to test_logs/01b1d32/4n8g/LibAI_gpt2_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb4_gb128_4n8g_20220706_140324618820537/config.yaml [07/06 14:03:37 lb.engine.default]: > compiling dataset index builder ... make: Entering directory '/dataset/xyn/libai_bench/libai/libai/data/data_utils' make: Nothing to be done for 'default'. make: Leaving directory '/dataset/xyn/libai_bench/libai/libai/data/data_utils' [07/06 14:03:37 lb.engine.default]: >>> done with dataset index builder. Compilation time: 0.041 seconds [07/06 14:03:37 lb.engine.default]: >>> done with compiling. Compilation time: 0.042 seconds [07/06 14:03:38 lb.engine.default]: Prepare training, validating, testing set [07/06 14:03:38 lb.data.data_utils.indexed_dataset]: building dataset index ... [07/06 14:03:38 lb.data.data_utils.indexed_dataset]: warming up index mmap file... [07/06 14:03:38 lb.data.data_utils.indexed_dataset]: reading sizes... [07/06 14:03:38 lb.data.data_utils.indexed_dataset]: reading pointers... [07/06 14:03:38 lb.data.data_utils.indexed_dataset]: reading document index... [07/06 14:03:38 lb.data.data_utils.indexed_dataset]: warming up data mmap file... [07/06 14:03:38 lb.data.data_utils.indexed_dataset]: creating numpy buffer of mmap... [07/06 14:03:38 lb.data.data_utils.indexed_dataset]: creating memory view of numpy buffer... [07/06 14:03:38 lb.data.data_utils.indexed_dataset]: Finished creating indexed dataset in 0.096363 seconds [07/06 14:03:38 lb.data.data_utils.indexed_dataset]: indexed dataset stats: [07/06 14:03:38 lb.data.data_utils.indexed_dataset]: number of documents: 50000 [07/06 14:03:38 lb.data.data_utils.indexed_dataset]: number of sentences: 1249934 [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  > loading doc-idx mapping from /dataset/source/dataset/loss_compara_content_sentence_gpt-2_indexmap_28160ns_1024sl_1234s_doc_idx.npy [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  > loading sample-idx mapping from /dataset/source/dataset/loss_compara_content_sentence_gpt-2_indexmap_28160ns_1024sl_1234s_sample_idx.npy [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  > loading shuffle-idx mapping from /dataset/source/dataset/loss_compara_content_sentence_gpt-2_indexmap_28160ns_1024sl_1234s_shuffle_idx.npy [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  loaded indexed file in 0.025 seconds [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  total number of samples: 57333 [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  total number of epochs: 1 [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  > WARNING: could not find index map files, building the indices on rank 0 ... [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  > only one epoch required, setting separate_last_epoch to False [07/06 14:03:38 lb.data.datasets.gpt_dataset]: start to build and save doc-idx mapping ... [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  > elapsed time to build and save doc-idx mapping (seconds): 0.127502 [07/06 14:03:38 lb.data.datasets.gpt_dataset]: start to build and save sample-idx mapping ... using: number of documents: 1249934 number of epochs: 1 sequence length: 1024 total number of samples: 57332 [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  > elapsed time to build and save sample-idx mapping (seconds): 0.034071 [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  > building shuffle index with split [0, 57332) and [57332, 57332) ... [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  > elapsed time to build and save shuffle-idx mapping (seconds): 0.022995 [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  > loading doc-idx mapping from /dataset/source/dataset/loss_compara_content_sentence_gpt-2_indexmap_128ns_1024sl_1234s_doc_idx.npy [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  > loading sample-idx mapping from /dataset/source/dataset/loss_compara_content_sentence_gpt-2_indexmap_128ns_1024sl_1234s_sample_idx.npy [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  > loading shuffle-idx mapping from /dataset/source/dataset/loss_compara_content_sentence_gpt-2_indexmap_128ns_1024sl_1234s_shuffle_idx.npy [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  loaded indexed file in 0.014 seconds [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  total number of samples: 57333 [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  total number of epochs: 1 [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  > loading doc-idx mapping from /dataset/source/dataset/loss_compara_content_sentence_gpt-2_indexmap_128ns_1024sl_1234s_doc_idx.npy [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  > loading sample-idx mapping from /dataset/source/dataset/loss_compara_content_sentence_gpt-2_indexmap_128ns_1024sl_1234s_sample_idx.npy [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  > loading shuffle-idx mapping from /dataset/source/dataset/loss_compara_content_sentence_gpt-2_indexmap_128ns_1024sl_1234s_shuffle_idx.npy [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  loaded indexed file in 0.002 seconds [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  total number of samples: 57333 [07/06 14:03:38 lb.data.datasets.gpt_dataset]:  total number of epochs: 1 [07/06 14:03:43 lb.engine.default]: Auto-scaling the config to train.train_iter=220, train.warmup_iter=0 [07/06 14:03:46 lb.engine.default]: Model: GPTForPreTraining( (GPT_model): GPTModel( (embeddings): GPTEmbedding( (token_embeddings): VocabEmbedding(num_embeddings=50304, embedding_dim=1024) (position_embeddings): Embedding(num_embeddings=1024, embedding_dim=1024) (dropout): Dropout(p=0.1, inplace=False) ) (transformer): Transformer( (layers): ModuleList( (0): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (1): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (2): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (3): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (4): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (5): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (6): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (7): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (8): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (9): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (10): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (11): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (12): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (13): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (14): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (15): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (16): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (17): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (18): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (19): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (20): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (21): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (22): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (23): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) ) (layernorm_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (lm_head): LMLogits() ) (loss_func): GPTLoss( (lm_loss): ParallelCrossEntropyLoss() ) ) WARNING [07/06 14:03:46 lb.scheduler.lr_scheduler]: warmup iters equals to zero, return CosineLR iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Bootstrap : Using eth0:192.168.11.230<0> iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v5 symbol. iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v5 symbol. iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO P2P plugin IBext iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.11.230<0> iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Using network IBext NCCL version 2.12.10+cuda11.2 iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO Bootstrap : Using eth0:192.168.11.230<0> iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Bootstrap : Using eth0:192.168.11.230<0> iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO Bootstrap : Using eth0:192.168.11.230<0> iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO Bootstrap : Using eth0:192.168.11.230<0> iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Bootstrap : Using eth0:192.168.11.230<0> iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v5 symbol. iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v5 symbol. iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO P2P plugin IBext iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v5 symbol. iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v5 symbol. iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v5 symbol. iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v5 symbol. iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO P2P plugin IBext iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO P2P plugin IBext iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v5 symbol. iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v5 symbol. iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v5 symbol. iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v5 symbol. iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO P2P plugin IBext iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO P2P plugin IBext iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.11.230<0> iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO Using network IBext iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.11.230<0> iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Using network IBext iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.11.230<0> iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO Using network IBext iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.11.230<0> iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO Using network IBext iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.11.230<0> iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Using network IBext iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Bootstrap : Using eth0:192.168.11.230<0> iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO Bootstrap : Using eth0:192.168.11.230<0> iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v5 symbol. iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v5 symbol. iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO P2P plugin IBext iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v5 symbol. iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v5 symbol. iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO P2P plugin IBext iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.11.230<0> iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Using network IBext iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth0:192.168.11.230<0> iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO Using network IBext iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23. iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23. iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7. iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7. iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23. iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7. iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23. iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7. iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23. iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7. iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23. iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23. iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7. iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7. iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO NCCL_IB_TIMEOUT set by environment to 23. iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO NCCL_IB_RETRY_CNT set by environment to 7. iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO PXN Disabled as plugin is v4 iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO Setting affinity for GPU 7 to 0fffff,fffffc00,00000000 iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO PXN Disabled as plugin is v4 iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO Setting affinity for GPU 3 to 03ff,ffffffff iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO PXN Disabled as plugin is v4 iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO Setting affinity for GPU 6 to 0fffff,fffffc00,00000000 iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO PXN Disabled as plugin is v4 iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Setting affinity for GPU 4 to 0fffff,fffffc00,00000000 iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO PXN Disabled as plugin is v4 iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO PXN Disabled as plugin is v4 iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO PXN Disabled as plugin is v4 iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO Setting affinity for GPU 1 to 03ff,ffffffff iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Setting affinity for GPU 2 to 03ff,ffffffff iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO PXN Disabled as plugin is v4 iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Setting affinity for GPU 5 to 0fffff,fffffc00,00000000 iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/-1/-1->2->0 iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO Trees [0] -1/-1/-1->6->1 [1] -1/-1/-1->6->1 iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO Trees [0] 0/-1/-1->7->5 [1] 0/-1/-1->7->5 iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Channel 00/02 : 0 1 3 2 5 12 14 15 8 9 11 10 13 20 22 23 16 17 19 18 iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Trees [0] 7/-1/-1->5->4 [1] 7/-1/-1->5->4 iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO Trees [0] 6/-1/-1->1->3 [1] 6/-1/-1->1->3 iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Trees [0] 5/20/-1->4->-1 [1] 5/-1/-1->4->12 iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Channel 01/02 : 0 1 3 2 5 12 14 15 8 9 11 10 13 20 22 23 16 17 19 18 iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Trees [0] 2/-1/-1->0->7 [1] 2/-1/-1->0->7 iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO Trees [0] 1/-1/-1->3->2 [1] 1/-1/-1->3->2 iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO Channel 00 : 6[6b010] -> 7[6b020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Channel 00 : 4[69010] -> 6[6b010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Channel 00 : 2[67010] -> 5[69020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Channel 00 : 0[65010] -> 1[65020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO Channel 01 : 6[6b010] -> 7[6b020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Channel 01 : 4[69010] -> 6[6b010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Channel 01 : 2[67010] -> 5[69020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Channel 01 : 0[65010] -> 1[65020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO Channel 00 : 7[6b020] -> 0[65010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO Channel 00 : 1[65020] -> 3[67020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO Channel 01 : 7[6b020] -> 0[65010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO Channel 01 : 1[65020] -> 3[67020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Channel 00/0 : 5[69020] -> 12[69010] [send] via NET/IBext/0/GDRDMA iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO Connected all rings iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Connected all rings iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Channel 00 : 0[65010] -> 2[67010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO Connected all rings iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO Channel 00 : 3[67020] -> 2[67010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO Connected all rings iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO Channel 00 : 1[65020] -> 6[6b010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Channel 00/0 : 29[69020] -> 4[69010] [receive] via NET/IBext/0/GDRDMA iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Channel 01 : 0[65010] -> 2[67010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO Channel 01 : 1[65020] -> 6[6b010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO Channel 01 : 3[67020] -> 2[67010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO Channel 00 : 6[6b010] -> 1[65020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Connected all rings iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Channel 00 : 2[67010] -> 3[67020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO Connected all rings iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO Channel 01 : 6[6b010] -> 1[65020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Channel 01 : 2[67010] -> 3[67020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO Connected all trees iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO Channel 00 : 3[67020] -> 1[65020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO Channel 01 : 3[67020] -> 1[65020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Channel 00 : 2[67010] -> 0[65010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Channel 00 : 0[65010] -> 7[6b020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Channel 01 : 2[67010] -> 0[65010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Channel 01 : 0[65010] -> 7[6b020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO Connected all trees iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO Channel 00 : 1[65020] -> 4[69010] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO Connected all trees iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO Channel 00 : 3[67020] -> 5[69020] via P2P/indirect/4[69010] iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO Channel 01 : 1[65020] -> 4[69010] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO Channel 01 : 3[67020] -> 5[69020] via P2P/indirect/4[69010] iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Channel 01/0 : 5[69020] -> 12[69010] [send] via NET/IBext/0/GDRDMA iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Channel 01/0 : 29[69020] -> 4[69010] [receive] via NET/IBext/0/GDRDMA iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Connected all rings iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Connected all rings iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Channel 00 : 4[69010] -> 5[69020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Channel 01 : 4[69010] -> 5[69020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Channel 00 : 5[69020] -> 7[6b020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Channel 01 : 5[69020] -> 7[6b020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO Channel 00 : 7[6b020] -> 5[69020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO Channel 01 : 7[6b020] -> 5[69020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO Connected all trees iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Channel 00 : 5[69020] -> 4[69010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Channel 01 : 5[69020] -> 4[69010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Connected all trees iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Channel 00 : 2[67010] -> 4[69010] via P2P/indirect/5[69020] iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Channel 01 : 2[67010] -> 4[69010] via P2P/indirect/5[69020] iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Connected all trees iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Channel 00 : 0[65010] -> 4[69010] via P2P/indirect/3[67020] iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Channel 01 : 0[65010] -> 4[69010] via P2P/indirect/3[67020] iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Channel 01/0 : 4[69010] -> 12[69010] [send] via NET/IBext/0/GDRDMA iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Channel 00/0 : 20[69010] -> 4[69010] [receive] via NET/IBext/0/GDRDMA iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Channel 00/0 : 4[69010] -> 20[69010] [send] via NET/IBext/0/GDRDMA iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Channel 01/0 : 12[69010] -> 4[69010] [receive] via NET/IBext/0/GDRDMA iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Connected all trees iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Connected all trees iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Channel 00 : 2[67010] -> 6[6b010] via P2P/indirect/1[65020] iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO Channel 00 : 3[67020] -> 6[6b010] via P2P/indirect/1[65020] iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Channel 01 : 2[67010] -> 6[6b010] via P2P/indirect/1[65020] iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO Channel 01 : 3[67020] -> 6[6b010] via P2P/indirect/1[65020] iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO Channel 00 : 1[65020] -> 5[69020] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO Channel 01 : 1[65020] -> 5[69020] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO Channel 00 : 3[67020] -> 7[6b020] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Channel 00 : 4[69010] -> 0[65010] via P2P/indirect/7[6b020] iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO Channel 00 : 1[65020] -> 7[6b020] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Channel 01 : 4[69010] -> 0[65010] via P2P/indirect/7[6b020] iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Channel 00 : 0[65010] -> 5[69020] via P2P/indirect/7[6b020] iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO Channel 01 : 3[67020] -> 7[6b020] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO Channel 01 : 1[65020] -> 7[6b020] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Channel 01 : 0[65010] -> 5[69020] via P2P/indirect/7[6b020] iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Channel 00 : 2[67010] -> 7[6b020] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO Channel 01 : 2[67010] -> 7[6b020] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Channel 00 : 5[69020] -> 0[65010] via P2P/indirect/7[6b020] iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Channel 00 : 0[65010] -> 6[6b010] via P2P/indirect/1[65020] iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Channel 01 : 5[69020] -> 0[65010] via P2P/indirect/7[6b020] iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Channel 01 : 0[65010] -> 6[6b010] via P2P/indirect/1[65020] iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO Channel 00 : 6[6b010] -> 0[65010] via P2P/indirect/7[6b020] iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO Channel 01 : 6[6b010] -> 0[65010] via P2P/indirect/7[6b020] iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO Channel 00 : 6[6b010] -> 2[67010] via P2P/indirect/5[69020] iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO Channel 01 : 6[6b010] -> 2[67010] via P2P/indirect/5[69020] iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO Channel 00 : 7[6b020] -> 1[65020] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO Channel 01 : 7[6b020] -> 1[65020] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Channel 00 : 5[69020] -> 1[65020] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO Channel 00 : 7[6b020] -> 2[67010] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Channel 01 : 5[69020] -> 1[65020] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO Channel 01 : 7[6b020] -> 2[67010] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Channel 00 : 4[69010] -> 1[65020] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO Channel 00 : 7[6b020] -> 3[67020] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Channel 01 : 4[69010] -> 1[65020] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Channel 00 : 5[69020] -> 3[67020] via P2P/indirect/2[67010] iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO Channel 01 : 5[69020] -> 3[67020] via P2P/indirect/2[67010] iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO Channel 01 : 7[6b020] -> 3[67020] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO Channel 00 : 6[6b010] -> 3[67020] via P2P/indirect/1[65020] iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Channel 00 : 4[69010] -> 2[67010] via P2P/indirect/3[67020] iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO Channel 01 : 4[69010] -> 2[67010] via P2P/indirect/3[67020] iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO Channel 01 : 6[6b010] -> 3[67020] via P2P/indirect/1[65020] iv-ybpu7pvmiu5m57lh5kdd:3618:4815 [6] NCCL INFO comm 0x7f2b94d2f730 rank 6 nranks 32 cudaDev 6 busId 6b010 - Init COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3619:4819 [7] NCCL INFO comm 0x7f8a154ac8d0 rank 7 nranks 32 cudaDev 7 busId 6b020 - Init COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3614:4816 [2] NCCL INFO comm 0x7fd5e9019d00 rank 2 nranks 32 cudaDev 2 busId 67010 - Init COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3617:4813 [5] NCCL INFO comm 0x7f8de93060e0 rank 5 nranks 32 cudaDev 5 busId 69020 - Init COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3616:4812 [4] NCCL INFO comm 0x7f45a56795d0 rank 4 nranks 32 cudaDev 4 busId 69010 - Init COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3615:4824 [3] NCCL INFO comm 0x7f71247c4810 rank 3 nranks 32 cudaDev 3 busId 67020 - Init COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO comm 0x7f4a34c4aa40 rank 0 nranks 32 cudaDev 0 busId 65010 - Init COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3613:4821 [1] NCCL INFO comm 0x7fb32b8c0080 rank 1 nranks 32 cudaDev 1 busId 65020 - Init COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3612:4814 [0] NCCL INFO Launch mode Parallel [07/06 14:03:58 lb.engine.trainer]: Starting training from iteration 0 [07/06 14:03:58 lb.models.utils.graph_base]: Start compling the train graph which may take some time. Please wait for a moment ... iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO Setting affinity for GPU 5 to 0fffff,fffffc00,00000000 iv-ybpu7pvmiu5m57lh5kdd:3618:5848 [6] NCCL INFO Setting affinity for GPU 6 to 0fffff,fffffc00,00000000 iv-ybpu7pvmiu5m57lh5kdd:3619:5940 [7] NCCL INFO Setting affinity for GPU 7 to 0fffff,fffffc00,00000000 iv-ybpu7pvmiu5m57lh5kdd:3615:5820 [3] NCCL INFO Setting affinity for GPU 3 to 03ff,ffffffff iv-ybpu7pvmiu5m57lh5kdd:3613:5868 [1] NCCL INFO Setting affinity for GPU 1 to 03ff,ffffffff iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Setting affinity for GPU 4 to 0fffff,fffffc00,00000000 iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO Setting affinity for GPU 2 to 03ff,ffffffff iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,ffffffff iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/-1/-1->2->0 iv-ybpu7pvmiu5m57lh5kdd:3618:5848 [6] NCCL INFO Trees [0] -1/-1/-1->6->1 [1] -1/-1/-1->6->1 iv-ybpu7pvmiu5m57lh5kdd:3619:5940 [7] NCCL INFO Trees [0] 0/-1/-1->7->5 [1] 0/-1/-1->7->5 iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO Trees [0] 7/-1/-1->5->4 [1] 7/-1/-1->5->4 iv-ybpu7pvmiu5m57lh5kdd:3615:5820 [3] NCCL INFO Trees [0] 1/-1/-1->3->2 [1] 1/-1/-1->3->2 iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Trees [0] 5/20/-1->4->-1 [1] 5/-1/-1->4->12 iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO Channel 00/02 : 0 1 3 2 5 12 14 15 8 9 11 10 13 20 22 23 16 17 19 18 iv-ybpu7pvmiu5m57lh5kdd:3613:5868 [1] NCCL INFO Trees [0] 6/-1/-1->1->3 [1] 6/-1/-1->1->3 iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO Channel 01/02 : 0 1 3 2 5 12 14 15 8 9 11 10 13 20 22 23 16 17 19 18 iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO Trees [0] 2/-1/-1->0->7 [1] 2/-1/-1->0->7 iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Channel 00 : 4[69010] -> 6[6b010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3618:5848 [6] NCCL INFO Channel 00 : 6[6b010] -> 7[6b020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO Channel 00 : 2[67010] -> 5[69020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO Channel 00 : 0[65010] -> 1[65020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Channel 01 : 4[69010] -> 6[6b010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3618:5848 [6] NCCL INFO Channel 01 : 6[6b010] -> 7[6b020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO Channel 01 : 2[67010] -> 5[69020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO Channel 01 : 0[65010] -> 1[65020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3619:5940 [7] NCCL INFO Channel 00 : 7[6b020] -> 0[65010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3613:5868 [1] NCCL INFO Channel 00 : 1[65020] -> 3[67020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3619:5940 [7] NCCL INFO Channel 01 : 7[6b020] -> 0[65010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3613:5868 [1] NCCL INFO Channel 01 : 1[65020] -> 3[67020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO Channel 00/0 : 5[69020] -> 12[69010] [send] via NET/IBext/0/GDRDMA iv-ybpu7pvmiu5m57lh5kdd:3618:5848 [6] NCCL INFO Connected all rings iv-ybpu7pvmiu5m57lh5kdd:3613:5868 [1] NCCL INFO Connected all rings iv-ybpu7pvmiu5m57lh5kdd:3613:5868 [1] NCCL INFO Channel 00 : 1[65020] -> 6[6b010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3615:5820 [3] NCCL INFO Channel 00 : 3[67020] -> 2[67010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3619:5940 [7] NCCL INFO Connected all rings iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO Connected all rings iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO Channel 00 : 0[65010] -> 2[67010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Channel 00/0 : 29[69020] -> 4[69010] [receive] via NET/IBext/0/GDRDMA iv-ybpu7pvmiu5m57lh5kdd:3613:5868 [1] NCCL INFO Channel 01 : 1[65020] -> 6[6b010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3615:5820 [3] NCCL INFO Channel 01 : 3[67020] -> 2[67010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO Channel 01 : 0[65010] -> 2[67010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3618:5848 [6] NCCL INFO Channel 00 : 6[6b010] -> 1[65020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3615:5820 [3] NCCL INFO Connected all rings iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO Connected all rings iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO Channel 00 : 2[67010] -> 3[67020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3618:5848 [6] NCCL INFO Channel 01 : 6[6b010] -> 1[65020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO Channel 01 : 2[67010] -> 3[67020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3618:5848 [6] NCCL INFO Connected all trees iv-ybpu7pvmiu5m57lh5kdd:3618:5848 [6] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ybpu7pvmiu5m57lh5kdd:3618:5848 [6] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ybpu7pvmiu5m57lh5kdd:3615:5820 [3] NCCL INFO Channel 00 : 3[67020] -> 1[65020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3615:5820 [3] NCCL INFO Channel 01 : 3[67020] -> 1[65020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO Channel 00 : 2[67010] -> 0[65010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO Channel 00 : 0[65010] -> 7[6b020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3615:5820 [3] NCCL INFO Connected all trees iv-ybpu7pvmiu5m57lh5kdd:3615:5820 [3] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ybpu7pvmiu5m57lh5kdd:3615:5820 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ybpu7pvmiu5m57lh5kdd:3615:5820 [3] NCCL INFO Channel 00 : 3[67020] -> 5[69020] via P2P/indirect/4[69010] iv-ybpu7pvmiu5m57lh5kdd:3613:5868 [1] NCCL INFO Connected all trees iv-ybpu7pvmiu5m57lh5kdd:3613:5868 [1] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ybpu7pvmiu5m57lh5kdd:3613:5868 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ybpu7pvmiu5m57lh5kdd:3613:5868 [1] NCCL INFO Channel 00 : 1[65020] -> 4[69010] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO Channel 01 : 2[67010] -> 0[65010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO Channel 01 : 0[65010] -> 7[6b020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3615:5820 [3] NCCL INFO Channel 01 : 3[67020] -> 5[69020] via P2P/indirect/4[69010] iv-ybpu7pvmiu5m57lh5kdd:3613:5868 [1] NCCL INFO Channel 01 : 1[65020] -> 4[69010] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO Channel 01/0 : 5[69020] -> 12[69010] [send] via NET/IBext/0/GDRDMA iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Channel 01/0 : 29[69020] -> 4[69010] [receive] via NET/IBext/0/GDRDMA iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO Connected all rings iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Connected all rings iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Channel 00 : 4[69010] -> 5[69020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Channel 01 : 4[69010] -> 5[69020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO Channel 00 : 5[69020] -> 7[6b020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO Channel 01 : 5[69020] -> 7[6b020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3619:5940 [7] NCCL INFO Channel 00 : 7[6b020] -> 5[69020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3619:5940 [7] NCCL INFO Channel 01 : 7[6b020] -> 5[69020] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO Channel 00 : 5[69020] -> 4[69010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3619:5940 [7] NCCL INFO Connected all trees iv-ybpu7pvmiu5m57lh5kdd:3619:5940 [7] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ybpu7pvmiu5m57lh5kdd:3619:5940 [7] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO Connected all trees iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO Channel 00 : 2[67010] -> 4[69010] via P2P/indirect/5[69020] iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO Channel 01 : 5[69020] -> 4[69010] via P2P/IPC iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO Connected all trees iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO Channel 00 : 0[65010] -> 4[69010] via P2P/indirect/3[67020] iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO Channel 01 : 2[67010] -> 4[69010] via P2P/indirect/5[69020] iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO Channel 01 : 0[65010] -> 4[69010] via P2P/indirect/3[67020] iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Channel 01/0 : 4[69010] -> 12[69010] [send] via NET/IBext/0/GDRDMA iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Channel 00/0 : 20[69010] -> 4[69010] [receive] via NET/IBext/0/GDRDMA iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Channel 00/0 : 4[69010] -> 20[69010] [send] via NET/IBext/0/GDRDMA iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Channel 01/0 : 12[69010] -> 4[69010] [receive] via NET/IBext/0/GDRDMA iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Connected all trees iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO Connected all trees iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512 iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer iv-ybpu7pvmiu5m57lh5kdd:3615:5820 [3] NCCL INFO Channel 00 : 3[67020] -> 6[6b010] via P2P/indirect/1[65020] iv-ybpu7pvmiu5m57lh5kdd:3615:5820 [3] NCCL INFO Channel 01 : 3[67020] -> 6[6b010] via P2P/indirect/1[65020] iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO Channel 00 : 2[67010] -> 6[6b010] via P2P/indirect/1[65020] iv-ybpu7pvmiu5m57lh5kdd:3615:5820 [3] NCCL INFO Channel 00 : 3[67020] -> 7[6b020] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3615:5820 [3] NCCL INFO Channel 01 : 3[67020] -> 7[6b020] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO Channel 01 : 2[67010] -> 6[6b010] via P2P/indirect/1[65020] iv-ybpu7pvmiu5m57lh5kdd:3613:5868 [1] NCCL INFO Channel 00 : 1[65020] -> 5[69020] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO Channel 00 : 2[67010] -> 7[6b020] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3613:5868 [1] NCCL INFO Channel 01 : 1[65020] -> 5[69020] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO Channel 01 : 2[67010] -> 7[6b020] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Channel 00 : 4[69010] -> 0[65010] via P2P/indirect/7[6b020] iv-ybpu7pvmiu5m57lh5kdd:3613:5868 [1] NCCL INFO Channel 00 : 1[65020] -> 7[6b020] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO Channel 00 : 0[65010] -> 5[69020] via P2P/indirect/7[6b020] iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Channel 01 : 4[69010] -> 0[65010] via P2P/indirect/7[6b020] iv-ybpu7pvmiu5m57lh5kdd:3613:5868 [1] NCCL INFO Channel 01 : 1[65020] -> 7[6b020] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO Channel 01 : 0[65010] -> 5[69020] via P2P/indirect/7[6b020] iv-ybpu7pvmiu5m57lh5kdd:3619:5940 [7] NCCL INFO Channel 00 : 7[6b020] -> 1[65020] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO Channel 00 : 5[69020] -> 0[65010] via P2P/indirect/7[6b020] iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO Channel 00 : 0[65010] -> 6[6b010] via P2P/indirect/1[65020] iv-ybpu7pvmiu5m57lh5kdd:3619:5940 [7] NCCL INFO Channel 01 : 7[6b020] -> 1[65020] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO Channel 01 : 5[69020] -> 0[65010] via P2P/indirect/7[6b020] iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO Channel 01 : 0[65010] -> 6[6b010] via P2P/indirect/1[65020] iv-ybpu7pvmiu5m57lh5kdd:3619:5940 [7] NCCL INFO Channel 00 : 7[6b020] -> 2[67010] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3618:5848 [6] NCCL INFO Channel 00 : 6[6b010] -> 0[65010] via P2P/indirect/7[6b020] iv-ybpu7pvmiu5m57lh5kdd:3619:5940 [7] NCCL INFO Channel 01 : 7[6b020] -> 2[67010] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3618:5848 [6] NCCL INFO Channel 01 : 6[6b010] -> 0[65010] via P2P/indirect/7[6b020] iv-ybpu7pvmiu5m57lh5kdd:3619:5940 [7] NCCL INFO Channel 00 : 7[6b020] -> 3[67020] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3619:5940 [7] NCCL INFO Channel 01 : 7[6b020] -> 3[67020] via P2P/indirect/0[65010] iv-ybpu7pvmiu5m57lh5kdd:3618:5848 [6] NCCL INFO Channel 00 : 6[6b010] -> 2[67010] via P2P/indirect/5[69020] iv-ybpu7pvmiu5m57lh5kdd:3618:5848 [6] NCCL INFO Channel 01 : 6[6b010] -> 2[67010] via P2P/indirect/5[69020] iv-ybpu7pvmiu5m57lh5kdd:3618:5848 [6] NCCL INFO Channel 00 : 6[6b010] -> 3[67020] via P2P/indirect/1[65020] iv-ybpu7pvmiu5m57lh5kdd:3618:5848 [6] NCCL INFO Channel 01 : 6[6b010] -> 3[67020] via P2P/indirect/1[65020] iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO Channel 00 : 5[69020] -> 1[65020] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO Channel 01 : 5[69020] -> 1[65020] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Channel 00 : 4[69010] -> 1[65020] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO Channel 00 : 5[69020] -> 3[67020] via P2P/indirect/2[67010] iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO Channel 01 : 5[69020] -> 3[67020] via P2P/indirect/2[67010] iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Channel 01 : 4[69010] -> 1[65020] via P2P/indirect/6[6b010] iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Channel 00 : 4[69010] -> 2[67010] via P2P/indirect/3[67020] iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO Channel 01 : 4[69010] -> 2[67010] via P2P/indirect/3[67020] iv-ybpu7pvmiu5m57lh5kdd:3619:5940 [7] NCCL INFO comm 0x7f84f40196e0 rank 7 nranks 32 cudaDev 7 busId 6b020 - Init COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3618:5848 [6] NCCL INFO comm 0x7f26740196e0 rank 6 nranks 32 cudaDev 6 busId 6b010 - Init COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3615:5820 [3] NCCL INFO comm 0x7f6c100196e0 rank 3 nranks 32 cudaDev 3 busId 67020 - Init COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3616:5939 [4] NCCL INFO comm 0x7f40780196e0 rank 4 nranks 32 cudaDev 4 busId 69010 - Init COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3617:5922 [5] NCCL INFO comm 0x7f88c40196e0 rank 5 nranks 32 cudaDev 5 busId 69020 - Init COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3614:5964 [2] NCCL INFO comm 0x7fd0cc0196e0 rank 2 nranks 32 cudaDev 2 busId 67010 - Init COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3613:5868 [1] NCCL INFO comm 0x7fae000196e0 rank 1 nranks 32 cudaDev 1 busId 65020 - Init COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3612:5796 [0] NCCL INFO comm 0x7f487c58d5a0 rank 0 nranks 32 cudaDev 0 busId 65010 - Init COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3612:5768 [0] NCCL INFO Launch mode Parallel timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/07/06 14:09:01.879, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 61 %, 32510 MiB, 11670 MiB, 20840 MiB 2022/07/06 14:09:01.882, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 50 %, 32510 MiB, 11686 MiB, 20824 MiB 2022/07/06 14:09:01.893, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 48 %, 32510 MiB, 11862 MiB, 20648 MiB 2022/07/06 14:09:01.897, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 60 %, 32510 MiB, 11822 MiB, 20688 MiB 2022/07/06 14:09:01.899, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 51 %, 32510 MiB, 11790 MiB, 20720 MiB 2022/07/06 14:09:01.900, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 63 %, 32510 MiB, 11742 MiB, 20768 MiB 2022/07/06 14:09:01.901, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 62 %, 32510 MiB, 11678 MiB, 20832 MiB 2022/07/06 14:09:01.901, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 63 %, 32510 MiB, 11686 MiB, 20824 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/07/06 14:09:01.902, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 61 %, 32510 MiB, 11670 MiB, 20840 MiB 2022/07/06 14:09:01.904, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 43 %, 32510 MiB, 11670 MiB, 20840 MiB 2022/07/06 14:09:01.905, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 50 %, 32510 MiB, 11686 MiB, 20824 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/07/06 14:09:01.915, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 50 %, 32510 MiB, 11686 MiB, 20824 MiB 2022/07/06 14:09:01.915, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 48 %, 32510 MiB, 11862 MiB, 20648 MiB 2022/07/06 14:09:01.916, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 43 %, 32510 MiB, 11670 MiB, 20840 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/07/06 14:09:01.918, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 48 %, 32510 MiB, 11862 MiB, 20648 MiB 2022/07/06 14:09:01.918, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 60 %, 32510 MiB, 11822 MiB, 20688 MiB 2022/07/06 14:09:01.919, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 50 %, 32510 MiB, 11686 MiB, 20824 MiB 2022/07/06 14:09:01.923, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 43 %, 32510 MiB, 11670 MiB, 20840 MiB 2022/07/06 14:09:01.923, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 43 %, 32510 MiB, 11670 MiB, 20840 MiB 2022/07/06 14:09:01.923, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 43 %, 32510 MiB, 11670 MiB, 20840 MiB 2022/07/06 14:09:01.924, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 60 %, 32510 MiB, 11822 MiB, 20688 MiB 2022/07/06 14:09:01.924, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 51 %, 32510 MiB, 11790 MiB, 20720 MiB 2022/07/06 14:09:01.929, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 48 %, 32510 MiB, 11862 MiB, 20648 MiB 2022/07/06 14:09:01.931, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 50 %, 32510 MiB, 11686 MiB, 20824 MiB 2022/07/06 14:09:01.931, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 50 %, 32510 MiB, 11686 MiB, 20824 MiB 2022/07/06 14:09:01.932, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 50 %, 32510 MiB, 11686 MiB, 20824 MiB 2022/07/06 14:09:01.932, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 51 %, 32510 MiB, 11790 MiB, 20720 MiB 2022/07/06 14:09:01.933, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 63 %, 32510 MiB, 11742 MiB, 20768 MiB 2022/07/06 14:09:01.935, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 60 %, 32510 MiB, 11822 MiB, 20688 MiB 2022/07/06 14:09:01.936, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 48 %, 32510 MiB, 11862 MiB, 20648 MiB 2022/07/06 14:09:01.937, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 48 %, 32510 MiB, 11862 MiB, 20648 MiB 2022/07/06 14:09:01.937, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 48 %, 32510 MiB, 11862 MiB, 20648 MiB 2022/07/06 14:09:01.938, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 63 %, 32510 MiB, 11742 MiB, 20768 MiB 2022/07/06 14:09:01.938, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 62 %, 32510 MiB, 11678 MiB, 20832 MiB 2022/07/06 14:09:01.939, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 51 %, 32510 MiB, 11790 MiB, 20720 MiB 2022/07/06 14:09:01.941, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 60 %, 32510 MiB, 11822 MiB, 20688 MiB 2022/07/06 14:09:01.941, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 60 %, 32510 MiB, 11822 MiB, 20688 MiB 2022/07/06 14:09:01.942, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 60 %, 32510 MiB, 11822 MiB, 20688 MiB 2022/07/06 14:09:01.942, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 62 %, 32510 MiB, 11678 MiB, 20832 MiB 2022/07/06 14:09:01.942, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 63 %, 32510 MiB, 11686 MiB, 20824 MiB 2022/07/06 14:09:01.943, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 63 %, 32510 MiB, 11742 MiB, 20768 MiB 2022/07/06 14:09:01.945, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 51 %, 32510 MiB, 11790 MiB, 20720 MiB 2022/07/06 14:09:01.945, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 51 %, 32510 MiB, 11790 MiB, 20720 MiB 2022/07/06 14:09:01.947, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 51 %, 32510 MiB, 11790 MiB, 20720 MiB 2022/07/06 14:09:01.947, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 63 %, 32510 MiB, 11686 MiB, 20824 MiB 2022/07/06 14:09:01.950, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 62 %, 32510 MiB, 11678 MiB, 20832 MiB 2022/07/06 14:09:01.953, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 63 %, 32510 MiB, 11742 MiB, 20768 MiB 2022/07/06 14:09:01.953, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 63 %, 32510 MiB, 11742 MiB, 20768 MiB 2022/07/06 14:09:01.953, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 63 %, 32510 MiB, 11742 MiB, 20768 MiB 2022/07/06 14:09:01.956, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 63 %, 32510 MiB, 11686 MiB, 20824 MiB 2022/07/06 14:09:01.957, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 26 %, 32510 MiB, 11678 MiB, 20832 MiB 2022/07/06 14:09:01.957, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 26 %, 32510 MiB, 11678 MiB, 20832 MiB 2022/07/06 14:09:01.958, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 26 %, 32510 MiB, 11678 MiB, 20832 MiB 2022/07/06 14:09:01.964, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 63 %, 32510 MiB, 11686 MiB, 20824 MiB 2022/07/06 14:09:01.964, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 63 %, 32510 MiB, 11686 MiB, 20824 MiB 2022/07/06 14:09:01.964, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 63 %, 32510 MiB, 11686 MiB, 20824 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/07/06 14:09:02.175, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 41 %, 32510 MiB, 11670 MiB, 20840 MiB 2022/07/06 14:09:02.177, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 32 %, 32510 MiB, 11686 MiB, 20824 MiB 2022/07/06 14:09:02.178, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 36 %, 32510 MiB, 11862 MiB, 20648 MiB 2022/07/06 14:09:02.180, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 22 %, 32510 MiB, 11822 MiB, 20688 MiB 2022/07/06 14:09:02.180, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 30 %, 32510 MiB, 11790 MiB, 20720 MiB 2022/07/06 14:09:02.181, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 62 %, 32510 MiB, 11742 MiB, 20768 MiB 2022/07/06 14:09:02.182, Tesla V100-SXM2-32GB, 470.57.02, 99 %, 56 %, 32510 MiB, 11678 MiB, 20832 MiB 2022/07/06 14:09:02.183, Tesla V100-SXM2-32GB, 470.57.02, 100 %, 68 %, 32510 MiB, 11686 MiB, 20824 MiB [07/06 14:09:02 lb.utils.events]:  eta: 0:00:45 iteration: 99/220 consumed_samples: 12800 total_loss: 7.272 time: 0.3837 s/iter data_time: 0.0198 s/iter total_throughput: 333.58 samples/s lr: 8.74e-05 [07/06 14:09:41 lb.utils.events]:  eta: 0:00:07 iteration: 199/220 consumed_samples: 25600 total_loss: 6.956 time: 0.3852 s/iter data_time: 0.0158 s/iter total_throughput: 332.27 samples/s lr: 4.81e-06 [07/06 14:09:49 lb.utils.events]:  eta: 0:00:00 iteration: 219/220 consumed_samples: 28160 total_loss: 6.703 time: 0.3884 s/iter data_time: 0.0624 s/iter total_throughput: 329.58 samples/s lr: 1.51e-06 [07/06 14:09:49 lb.engine.hooks]: Overall training speed: 218 iterations in 0:01:24 (0.3884 s / it) [07/06 14:09:49 lb.engine.hooks]: Total training time: 0:01:24 (0:00:00 on hooks) iv-ybpu7pvmiu5m57lh5kdd:3616:3616 [4] NCCL INFO comm 0x7f40780196e0 rank 4 nranks 32 cudaDev 4 busId 69010 - Destroy COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3615:3615 [3] NCCL INFO comm 0x7f6c100196e0 rank 3 nranks 32 cudaDev 3 busId 67020 - Destroy COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3619:3619 [7] NCCL INFO comm 0x7f84f40196e0 rank 7 nranks 32 cudaDev 7 busId 6b020 - Destroy COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3617:3617 [5] NCCL INFO comm 0x7f88c40196e0 rank 5 nranks 32 cudaDev 5 busId 69020 - Destroy COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3614:3614 [2] NCCL INFO comm 0x7fd0cc0196e0 rank 2 nranks 32 cudaDev 2 busId 67010 - Destroy COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3618:3618 [6] NCCL INFO comm 0x7f26740196e0 rank 6 nranks 32 cudaDev 6 busId 6b010 - Destroy COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3612:3612 [0] NCCL INFO comm 0x7f487c58d5a0 rank 0 nranks 32 cudaDev 0 busId 65010 - Destroy COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3613:3613 [1] NCCL INFO comm 0x7fae000196e0 rank 1 nranks 32 cudaDev 1 busId 65020 - Destroy COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3615:3615 [3] NCCL INFO comm 0x7f71247c4810 rank 3 nranks 32 cudaDev 3 busId 67020 - Destroy COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3614:3614 [2] NCCL INFO comm 0x7fd5e9019d00 rank 2 nranks 32 cudaDev 2 busId 67010 - Destroy COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3617:3617 [5] NCCL INFO comm 0x7f8de93060e0 rank 5 nranks 32 cudaDev 5 busId 69020 - Destroy COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3616:3616 [4] NCCL INFO comm 0x7f45a56795d0 rank 4 nranks 32 cudaDev 4 busId 69010 - Destroy COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3613:3613 [1] NCCL INFO comm 0x7fb32b8c0080 rank 1 nranks 32 cudaDev 1 busId 65020 - Destroy COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3619:3619 [7] NCCL INFO comm 0x7f8a154ac8d0 rank 7 nranks 32 cudaDev 7 busId 6b020 - Destroy COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3618:3618 [6] NCCL INFO comm 0x7f2b94d2f730 rank 6 nranks 32 cudaDev 6 busId 6b010 - Destroy COMPLETE iv-ybpu7pvmiu5m57lh5kdd:3612:3612 [0] NCCL INFO comm 0x7f4a34c4aa40 rank 0 nranks 32 cudaDev 0 busId 65010 - Destroy COMPLETE ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. *****************************************