[03/05 16:29:28 libai]: Rank of current process: 0. World size: 8
[03/05 16:29:28 libai]: Command line arguments: Namespace(config_file='configs/bert_large_pretrain.py', eval_only=False, fast_dev_run=False, opts=['model.cfg.hidden_dropout_prob=0.1', 'model.cfg.attention_probs_dropout_prob=0.1', 'model.cfg.bias_dropout_fusion=true', 'model.cfg.hidden_layers=24', 'model.cfg.hidden_size=1024', 'model.cfg.num_attention_heads=16', 'model.cfg.intermediate_size=4096', 'model.cfg.ffn_hidden_size=4096', 'model.cfg.head_size=64', 'graph.enabled=true', 'train.dist.pipeline_num_layers=24', 'train.train_micro_batch_size=32', 'train.global_batch_size=512', 'train.dist.tensor_parallel_size=2', 'train.dist.pipeline_parallel_size=2', 'train.amp.enabled=true', 'train.activation_checkpoint.enabled=true', 'train.num_accumulation_steps=8', 'train.evaluation.enabled=false', 'train.train_iter=220', 'train.train_epoch=0', 'train.log_period=100', 'train.zero_optimization.enabled=true', 'train.zero_optimization.stage=2', 'train.load_weight=', 'train.output_dir=test_logs/oneflow-28/NVIDIA_GeForce_RTX_3080_Ti/7d07caf/LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP2_MP2_PP2_zerotrue_stage2_mbs32_gbs512_acc8_1n8g'], resume=False)
[03/05 16:29:28 libai]: Contents of args.config_file=configs/bert_large_pretrain.py:
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15mlibai[39m[38;5;15m.[39m[38;5;15mconfig[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mLazyCall[39m
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15mlibai[39m[38;5;15m.[39m[38;5;15mevaluation[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mPPLEvaluator[39m
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15m.[39m[38;5;15mcommon[39m[38;5;15m.[39m[38;5;15mmodels[39m[38;5;15m.[39m[38;5;15mbert[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mpretrain_model[39m[38;5;15m [39m[38;5;81mas[39m[38;5;15m [39m[38;5;15mmodel[39m
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15m.[39m[38;5;15mcommon[39m[38;5;15m.[39m[38;5;15mmodels[39m[38;5;15m.[39m[38;5;15mgraph[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mgraph[39m
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15m.[39m[38;5;15mcommon[39m[38;5;15m.[39m[38;5;15mtrain[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mtrain[39m
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15m.[39m[38;5;15mcommon[39m[38;5;15m.[39m[38;5;15moptim[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15moptim[39m
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15m.[39m[38;5;15mcommon[39m[38;5;15m.[39m[38;5;15mdata[39m[38;5;15m.[39m[38;5;15mbert_dataset[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mdataloader[39m[38;5;15m,[39m[38;5;15m [39m[38;5;15mtokenization[39m

[38;5;15mvocab_file[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;186m"[39m[38;5;186m./data_test/bert_data/bert-base-chinese-vocab.txt[39m[38;5;186m"[39m
[38;5;15mdata_prefix[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;186m"[39m[38;5;186m./data_test/bert_data/loss_compara_content_sentence[39m[38;5;186m"[39m

[38;5;15mtokenization[39m[38;5;197m.[39m[38;5;15mtokenizer[39m[38;5;197m.[39m[38;5;15mvocab_file[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;15mvocab_file[39m
[38;5;15mdataloader[39m[38;5;197m.[39m[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mdataset[39m[38;5;15m[[39m[38;5;141m0[39m[38;5;15m][39m[38;5;197m.[39m[38;5;15mdata_prefix[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;15mdata_prefix[39m
[38;5;15mdataloader[39m[38;5;197m.[39m[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mdataset[39m[38;5;15m[[39m[38;5;141m0[39m[38;5;15m][39m[38;5;197m.[39m[38;5;15mindexed_dataset[39m[38;5;197m.[39m[38;5;15mdata_prefix[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;15mdata_prefix[39m

[38;5;242m# Bert-large model config[39m
[38;5;15mmodel[39m[38;5;197m.[39m[38;5;15mcfg[39m[38;5;197m.[39m[38;5;15mnum_attention_heads[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m16[39m
[38;5;15mmodel[39m[38;5;197m.[39m[38;5;15mcfg[39m[38;5;197m.[39m[38;5;15mhidden_size[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m768[39m
[38;5;15mmodel[39m[38;5;197m.[39m[38;5;15mcfg[39m[38;5;197m.[39m[38;5;15mhidden_layers[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m8[39m

[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15minput_placement_device[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;186m"[39m[38;5;186mcpu[39m[38;5;186m"[39m

[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mdist[39m[38;5;197m.[39m[38;5;15mpipeline_num_layers[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;15mmodel[39m[38;5;197m.[39m[38;5;15mcfg[39m[38;5;197m.[39m[38;5;15mhidden_layers[39m

[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mtrain_micro_batch_size[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m16[39m

[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mamp[39m[38;5;197m.[39m[38;5;15menabled[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;81mTrue[39m

[38;5;81mfor[39m[38;5;15m [39m[38;5;15mds[39m[38;5;15m [39m[38;5;197min[39m[38;5;15m [39m[38;5;15mdataloader[39m[38;5;197m.[39m[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mdataset[39m[38;5;15m:[39m
[38;5;15m    [39m[38;5;15mds[39m[38;5;197m.[39m[38;5;15mmax_seq_length[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;15mmodel[39m[38;5;197m.[39m[38;5;15mcfg[39m[38;5;197m.[39m[38;5;15mmax_position_embeddings[39m

[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mevaluation[39m[38;5;197m.[39m[38;5;15mevaluator[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;15mLazyCall[39m[38;5;15m([39m[38;5;15mPPLEvaluator[39m[38;5;15m)[39m[38;5;15m([39m[38;5;15m)[39m

[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15moutput_dir[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;186m"[39m[38;5;186moutput/bert_output[39m[38;5;186m"[39m

[03/05 16:29:28 libai]: Full config saved to test_logs/oneflow-28/NVIDIA_GeForce_RTX_3080_Ti/7d07caf/LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP2_MP2_PP2_zerotrue_stage2_mbs32_gbs512_acc8_1n8g/config.yaml
[03/05 16:29:28 lb.engine.default]: > compiling dataset index builder ...
make: Entering directory '/ssd/home/ouyangyu/libai_week_test/libai/libai/data/data_utils'
make: Nothing to be done for 'default'.
make: Leaving directory '/ssd/home/ouyangyu/libai_week_test/libai/libai/data/data_utils'
[03/05 16:29:28 lb.engine.default]: >>> done with dataset index builder. Compilation time: 0.054 seconds
[03/05 16:29:28 lb.engine.default]: >>> done with compiling. Compilation time: 0.056 seconds
[03/05 16:29:28 lb.engine.default]: Prepare training, validating, testing set
[03/05 16:29:28 lb.data.data_utils.indexed_dataset]: building dataset index ...
[03/05 16:29:28 lb.data.data_utils.indexed_dataset]: warming up index mmap file...
[03/05 16:29:28 lb.data.data_utils.indexed_dataset]: reading sizes...
[03/05 16:29:28 lb.data.data_utils.indexed_dataset]: reading pointers...
[03/05 16:29:28 lb.data.data_utils.indexed_dataset]: reading document index...
[03/05 16:29:28 lb.data.data_utils.indexed_dataset]: warming up data mmap file...
[03/05 16:29:28 lb.data.data_utils.indexed_dataset]: creating numpy buffer of mmap...
[03/05 16:29:28 lb.data.data_utils.indexed_dataset]: creating memory view of numpy buffer...
[03/05 16:29:28 lb.data.data_utils.indexed_dataset]: Finished creating indexed dataset in 0.074312 seconds
[03/05 16:29:28 lb.data.data_utils.indexed_dataset]: indexed dataset stats:
[03/05 16:29:28 lb.data.data_utils.indexed_dataset]: number of documents: 50000
[03/05 16:29:28 lb.data.data_utils.indexed_dataset]: number of sentences: 1249934
[03/05 16:29:28 lb.data.data_utils.dataset_utils]:  > loading indexed mapping from ./data_test/bert_data/loss_compara_content_sentence_bert_indexmap_112640mns_509msl_0.10ssp_1234s.npy
[03/05 16:29:28 lb.data.data_utils.dataset_utils]:     loaded indexed file in 0.003 seconds
[03/05 16:29:28 lb.data.data_utils.dataset_utils]:     total number of samples: 113036
[03/05 16:29:28 lb.data.data_utils.dataset_utils]:  > loading indexed mapping from ./data_test/bert_data/loss_compara_content_sentence_bert_indexmap_64mns_509msl_0.10ssp_1234s.npy
[03/05 16:29:28 lb.data.data_utils.dataset_utils]:     loaded indexed file in 0.000 seconds
[03/05 16:29:28 lb.data.data_utils.dataset_utils]:     total number of samples: 5884
[03/05 16:29:28 lb.data.data_utils.dataset_utils]:  > loading indexed mapping from ./data_test/bert_data/loss_compara_content_sentence_bert_indexmap_64mns_509msl_0.10ssp_1234s.npy
[03/05 16:29:28 lb.data.data_utils.dataset_utils]:     loaded indexed file in 0.000 seconds
[03/05 16:29:28 lb.data.data_utils.dataset_utils]:     total number of samples: 5884
[03/05 16:29:37 lb.engine.default]: Auto-scaling the config to train.train_iter=220, train.warmup_iter=0
[03/05 16:29:37 libai]: > Start building model...
[03/05 16:29:39 lb.engine.default]: Model:
BertForPreTraining(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (vocab_embeddings): VocabEmbedding(num_embeddings=21248, embedding_dim=1024)
      (position_embeddings): Embedding(num_embeddings=512, embedding_dim=1024)
      (tokentype_embeddings): Embedding(num_embeddings=2, embedding_dim=1024)
      (embedding_dropout): Dropout(p=0.1, inplace=False)
    )
    (extended_attn_mask): BertExtendedAttnMask()
    (encoders): ModuleList(
      (0): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (1): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (2): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (3): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (4): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (5): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (6): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (7): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (8): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (9): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (10): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (11): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (12): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (13): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (14): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (15): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (16): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (17): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (18): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (19): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (20): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (21): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (22): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
      (23): TransformerLayer(
        (drop_path): Identity()
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          hidden_size=1024, num_heads=16, is_cross_attention=False
          (dropout): Dropout(p=0.1, inplace=False)
          (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
          (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1
          (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
          (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
        )
      )
    )
    (final_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (pooler): BertPooler(
      (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=col)
      (activation_func): Tanh()
    )
  )
  (cls_head): BertPreTrainingHeads(
    (predictions): BertLMPredictionHead(
      (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=data)
      (activation_func): GELU()
      (layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    )
    (seq_relationship): Linear1D(in_features=1024, out_features=2, bias=True, parallel=data)
    (lm_logits): LMLogits()
    (loss_func): BertLoss(
      (lm_loss): ParallelCrossEntropyLoss()
    )
  )
)
[03/05 16:29:39 libai]: >>> done with building model. Building time: 2.015 seconds
WARNING [03/05 16:29:39 lb.scheduler.lr_scheduler]: warmup iters equals to zero, return CosineLR
[03/05 16:29:39 lb.engine.trainer]: Starting training from iteration 0
W20230305 16:29:39.919983 1905438 eager_local_op_interpreter.cpp:256] Casting a local tensor to a global tensor with Broadcast sbp will modify the data of input! If you want to keep the input local tensor unchanged, please set the arg copy to True.
[03/05 16:29:40 lb.models.utils.graph_base]: Start compiling the train graph which may take some time. Please wait for a moment ...
W20230305 16:30:06.501688 1905439 insert_nccl_logical_op_pass.cpp:1150]  In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1115 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance.
W20230305 16:30:06.547113 1905443 insert_nccl_logical_op_pass.cpp:1150]  In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1115 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance.
W20230305 16:30:06.777287 1905442 insert_nccl_logical_op_pass.cpp:1150]  In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1115 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance.
W20230305 16:30:06.847115 1905441 insert_nccl_logical_op_pass.cpp:1150]  In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1115 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance.
W20230305 16:30:07.013103 1905440 insert_nccl_logical_op_pass.cpp:1150]  In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1115 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance.
W20230305 16:30:07.289845 1905438 insert_nccl_logical_op_pass.cpp:1150]  In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1115 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance.
W20230305 16:30:07.483566 1905447 insert_nccl_logical_op_pass.cpp:1150]  In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1115 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance.
W20230305 16:30:07.489763 1905445 insert_nccl_logical_op_pass.cpp:1150]  In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1115 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance.
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/03/05 16:44:57.582, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 97 %, 18 %, 12288 MiB, 6064 MiB, 5989 MiB
2023/03/05 16:44:57.584, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6737 MiB, 5316 MiB
2023/03/05 16:44:57.588, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 84 %, 14 %, 12288 MiB, 6733 MiB, 5320 MiB
2023/03/05 16:44:57.593, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 17 %, 12288 MiB, 6733 MiB, 5320 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/03/05 16:44:57.596, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7563 MiB, 4490 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/03/05 16:44:57.596, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 97 %, 18 %, 12288 MiB, 6064 MiB, 5989 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
timestamp, nametimestamp, driver_version, name, driver_version, utilization.gpu [%], utilization.gpu [%], utilization.memory [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.total [MiB], memory.used [MiB], memory.free [MiB]
, memory.used [MiB]
2023/03/05 16:44:57.597, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/03/05 16:44:57.597, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 97 %, 18 %, 12288 MiB, 6064 MiB, 5989 MiB
2023/03/05 16:44:57.600, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6737 MiB, 5316 MiB
2023/03/05 16:44:57.600, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 78 %, 12 %, 12288 MiB, 6064 MiB, 5989 MiB
2023/03/05 16:44:57.601, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 78 %, 12 %, 12288 MiB, 6064 MiB, 5989 MiB
2023/03/05 16:44:57.601, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 78 %, 12 %, 12288 MiB, 6064 MiB, 5989 MiB
2023/03/05 16:44:57.601, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.602, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 78 %, 12 %, 12288 MiB, 6064 MiB, 5989 MiB
2023/03/05 16:44:57.604, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6737 MiB, 5316 MiB
2023/03/05 16:44:57.606, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 84 %, 14 %, 12288 MiB, 6733 MiB, 5320 MiB
2023/03/05 16:44:57.611, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6737 MiB, 5316 MiB
2023/03/05 16:44:57.612, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6737 MiB, 5316 MiB
2023/03/05 16:44:57.614, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6737 MiB, 5316 MiB
2023/03/05 16:44:57.614, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 5 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.616, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 79 %, 12 %, 12288 MiB, 6737 MiB, 5316 MiB
2023/03/05 16:44:57.618, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 84 %, 14 %, 12288 MiB, 6733 MiB, 5320 MiB
2023/03/05 16:44:57.619, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 17 %, 12288 MiB, 6733 MiB, 5320 MiB
2023/03/05 16:44:57.621, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 84 %, 14 %, 12288 MiB, 6733 MiB, 5320 MiB
2023/03/05 16:44:57.620, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 84 %, 14 %, 12288 MiB, 6733 MiB, 5320 MiB
2023/03/05 16:44:57.623, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 84 %, 14 %, 12288 MiB, 6733 MiB, 5320 MiB
2023/03/05 16:44:57.627, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 84 %, 14 %, 12288 MiB, 6733 MiB, 5320 MiB
2023/03/05 16:44:57.630, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 80 %, 12 %, 12288 MiB, 6733 MiB, 5320 MiB
2023/03/05 16:44:57.631, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7563 MiB, 4490 MiB
2023/03/05 16:44:57.632, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 80 %, 12 %, 12288 MiB, 6733 MiB, 5320 MiB
2023/03/05 16:44:57.634, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 80 %, 12 %, 12288 MiB, 6733 MiB, 5320 MiB
2023/03/05 16:44:57.634, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 80 %, 12 %, 12288 MiB, 6733 MiB, 5320 MiB
2023/03/05 16:44:57.637, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 80 %, 12 %, 12288 MiB, 6733 MiB, 5320 MiB
2023/03/05 16:44:57.642, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7563 MiB, 4490 MiB
2023/03/05 16:44:57.643, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.644, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7563 MiB, 4490 MiB
2023/03/05 16:44:57.645, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7563 MiB, 4490 MiB
2023/03/05 16:44:57.647, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7563 MiB, 4490 MiB
2023/03/05 16:44:57.649, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7563 MiB, 4490 MiB
2023/03/05 16:44:57.651, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.652, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.653, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.654, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.655, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.656, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.659, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.660, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 5 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.662, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.662, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.663, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.664, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.666, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 5 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.667, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 5 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.668, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 5 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.668, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 5 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:44:57.670, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 5 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/03/05 16:45:06.468, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 93 %, 15 %, 12288 MiB, 6064 MiB, 5989 MiB
2023/03/05 16:45:06.469, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 86 %, 14 %, 12288 MiB, 6737 MiB, 5316 MiB
2023/03/05 16:45:06.470, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6733 MiB, 5320 MiB
2023/03/05 16:45:06.471, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 80 %, 13 %, 12288 MiB, 6733 MiB, 5320 MiB
2023/03/05 16:45:06.472, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 7563 MiB, 4490 MiB
2023/03/05 16:45:06.474, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:45:06.475, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7583 MiB, 4470 MiB
2023/03/05 16:45:06.477, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7583 MiB, 4470 MiB
[03/05 16:45:15 lb.utils.events]:  eta: 0:17:54  iteration: 99/220  consumed_samples: 51200  total_loss: 7.953  lm_loss: 7.24  sop_loss: 0.7052  time: 8.9653 s/iter  data_time: 0.0152 s/iter total_throughput: 57.11 samples/s lr: 5.82e-05  
[03/05 17:00:09 lb.utils.events]:  eta: 0:02:58  iteration: 199/220  consumed_samples: 102400  total_loss: 7.903  lm_loss: 7.206  sop_loss: 0.6955  time: 8.9533 s/iter  data_time: 0.0157 s/iter total_throughput: 57.19 samples/s lr: 3.21e-06  
[03/05 17:03:08 lb.utils.events]:  eta: 0:00:00  iteration: 219/220  consumed_samples: 112640  total_loss: 7.896  lm_loss: 7.2  sop_loss: 0.6947  time: 8.9538 s/iter  data_time: 0.0158 s/iter total_throughput: 57.18 samples/s lr: 1.01e-06  
[03/05 17:03:08 lb.engine.hooks]: Overall training speed: 218 iterations in 0:32:31 (8.9538 s / it)
[03/05 17:03:08 lb.engine.hooks]: Total training time: 0:32:32 (0:00:00 on hooks)
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
oneflow-version(git_commit)=0.9.1.dev20230304+cu117
oneflow-commit(git_commit)=7d07caf
oneflow-libai(git_commit)=50a973dc5de635b8613ad7666c073c763e238850