[03/10 15:53:47 libai]: Rank of current process: 0. World size: 8
[03/10 15:53:47 libai]: Command line arguments: Namespace(config_file='configs/gpt2_pretrain.py', eval_only=False, fast_dev_run=False, opts=['model.cfg.hidden_dropout_prob=0.1', 'model.cfg.attention_probs_dropout_prob=0.1', 'model.cfg.bias_dropout_fusion=true', 'model.cfg.hidden_layers=24', 'model.cfg.hidden_size=1024', 'model.cfg.num_attention_heads=16', 'model.cfg.intermediate_size=4096', 'model.cfg.ffn_hidden_size=4096', 'model.cfg.head_size=64', 'graph.enabled=true', 'train.dist.pipeline_num_layers=24', 'train.train_micro_batch_size=8', 'train.global_batch_size=128', 'train.dist.tensor_parallel_size=2', 'train.dist.pipeline_parallel_size=2', 'train.amp.enabled=true', 'train.activation_checkpoint.enabled=true', 'train.num_accumulation_steps=8', 'train.evaluation.enabled=false', 'train.train_iter=220', 'train.train_epoch=0', 'train.log_period=100', 'train.zero_optimization.enabled=true', 'train.zero_optimization.stage=2', 'train.load_weight=', 'train.output_dir=test_logs/oneflow-28/NVIDIA_GeForce_RTX_3080_Ti/1ea2bb7/LibAI_gpt2_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP2_MP2_PP2_zerotrue_stage2_mbs8_gbs128_acc8_1n8g'], resume=False)
[03/10 15:53:47 libai]: Contents of args.config_file=configs/gpt2_pretrain.py:
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15mlibai[39m[38;5;15m.[39m[38;5;15mconfig[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mLazyCall[39m
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15mlibai[39m[38;5;15m.[39m[38;5;15mevaluation[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mPPLEvaluator[39m
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15m.[39m[38;5;15mcommon[39m[38;5;15m.[39m[38;5;15mmodels[39m[38;5;15m.[39m[38;5;15mgpt[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mpretrain_model[39m[38;5;15m [39m[38;5;81mas[39m[38;5;15m [39m[38;5;15mmodel[39m
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15m.[39m[38;5;15mcommon[39m[38;5;15m.[39m[38;5;15mtrain[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mtrain[39m
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15m.[39m[38;5;15mcommon[39m[38;5;15m.[39m[38;5;15moptim[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15moptim[39m
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15m.[39m[38;5;15mcommon[39m[38;5;15m.[39m[38;5;15mdata[39m[38;5;15m.[39m[38;5;15mgpt_dataset[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mdataloader[39m[38;5;15m,[39m[38;5;15m [39m[38;5;15mtokenization[39m

[38;5;197mfrom[39m[38;5;15m [39m[38;5;15m.[39m[38;5;15mcommon[39m[38;5;15m.[39m[38;5;15mmodels[39m[38;5;15m.[39m[38;5;15mgraph[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mgraph[39m

[38;5;15mvocab_file[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;186m"[39m[38;5;186m./data_test/gpt_data/gpt2-vocab.json[39m[38;5;186m"[39m
[38;5;15mmerge_files[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;186m"[39m[38;5;186m./data_test/gpt_data/gpt2-merges.txt[39m[38;5;186m"[39m
[38;5;15mdata_prefix[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;186m"[39m[38;5;186m./data_test/gpt_data/loss_compara_content_sentence[39m[38;5;186m"[39m

[38;5;15mtokenization[39m[38;5;197m.[39m[38;5;15mtokenizer[39m[38;5;197m.[39m[38;5;15mvocab_file[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;15mvocab_file[39m
[38;5;15mtokenization[39m[38;5;197m.[39m[38;5;15mtokenizer[39m[38;5;197m.[39m[38;5;15mmerges_file[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;15mmerge_files[39m
[38;5;15mdataloader[39m[38;5;197m.[39m[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mdataset[39m[38;5;15m[[39m[38;5;141m0[39m[38;5;15m][39m[38;5;197m.[39m[38;5;15mdata_prefix[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;15mdata_prefix[39m
[38;5;15mdataloader[39m[38;5;197m.[39m[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mdataset[39m[38;5;15m[[39m[38;5;141m0[39m[38;5;15m][39m[38;5;197m.[39m[38;5;15mindexed_dataset[39m[38;5;197m.[39m[38;5;15mdata_prefix[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;15mdata_prefix[39m

[38;5;242m# GPT-2 model config[39m
[38;5;15mmodel[39m[38;5;197m.[39m[38;5;15mcfg[39m[38;5;197m.[39m[38;5;15membedding_dropout_prob[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m0.1[39m
[38;5;15mmodel[39m[38;5;197m.[39m[38;5;15mcfg[39m[38;5;197m.[39m[38;5;15mattention_dropout_prob[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m0.1[39m
[38;5;15mmodel[39m[38;5;197m.[39m[38;5;15mcfg[39m[38;5;197m.[39m[38;5;15mnum_attention_heads[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m16[39m
[38;5;15mmodel[39m[38;5;197m.[39m[38;5;15mcfg[39m[38;5;197m.[39m[38;5;15mhidden_size[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m384[39m
[38;5;15mmodel[39m[38;5;197m.[39m[38;5;15mcfg[39m[38;5;197m.[39m[38;5;15mffn_hidden_size[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m1536[39m
[38;5;15mmodel[39m[38;5;197m.[39m[38;5;15mcfg[39m[38;5;197m.[39m[38;5;15mhidden_layers[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m6[39m
[38;5;15mmodel[39m[38;5;197m.[39m[38;5;15mcfg[39m[38;5;197m.[39m[38;5;15mmax_seq_length[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m1024[39m

[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15minput_placement_device[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;186m"[39m[38;5;186mcpu[39m[38;5;186m"[39m

[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mdist[39m[38;5;197m.[39m[38;5;15mpipeline_num_layers[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;15mmodel[39m[38;5;197m.[39m[38;5;15mcfg[39m[38;5;197m.[39m[38;5;15mhidden_layers[39m

[38;5;81mfor[39m[38;5;15m [39m[38;5;15mds[39m[38;5;15m [39m[38;5;197min[39m[38;5;15m [39m[38;5;15mdataloader[39m[38;5;197m.[39m[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mdataset[39m[38;5;15m:[39m
[38;5;15m    [39m[38;5;15mds[39m[38;5;197m.[39m[38;5;15mmax_seq_length[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;15mmodel[39m[38;5;197m.[39m[38;5;15mcfg[39m[38;5;197m.[39m[38;5;15mmax_seq_length[39m

[38;5;15moptim[39m[38;5;197m.[39m[38;5;15mlr[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m1.5e-4[39m

[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mtrain_micro_batch_size[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m4[39m
[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mamp[39m[38;5;197m.[39m[38;5;15menabled[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;81mTrue[39m

[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mevaluation[39m[38;5;197m.[39m[38;5;15mevaluator[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;15mLazyCall[39m[38;5;15m([39m[38;5;15mPPLEvaluator[39m[38;5;15m)[39m[38;5;15m([39m[38;5;15m)[39m

[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15moutput_dir[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;186m"[39m[38;5;186m./output/gpt2_output[39m[38;5;186m"[39m

[03/10 15:53:47 libai]: Full config saved to test_logs/oneflow-28/NVIDIA_GeForce_RTX_3080_Ti/1ea2bb7/LibAI_gpt2_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP2_MP2_PP2_zerotrue_stage2_mbs8_gbs128_acc8_1n8g/config.yaml
[03/10 15:53:47 lb.engine.default]: > compiling dataset index builder ...
make: Entering directory '/ssd/home/ouyangyu/libai_week_test/libai/libai/data/data_utils'
make: Nothing to be done for 'default'.
make: Leaving directory '/ssd/home/ouyangyu/libai_week_test/libai/libai/data/data_utils'
[03/10 15:53:47 lb.engine.default]: >>> done with dataset index builder. Compilation time: 0.057 seconds
[03/10 15:53:47 lb.engine.default]: >>> done with compiling. Compilation time: 0.058 seconds
[03/10 15:53:47 lb.engine.default]: Prepare training, validating, testing set
[03/10 15:53:47 lb.data.data_utils.indexed_dataset]: building dataset index ...
[03/10 15:53:47 lb.data.data_utils.indexed_dataset]: warming up index mmap file...
[03/10 15:53:47 lb.data.data_utils.indexed_dataset]: reading sizes...
[03/10 15:53:47 lb.data.data_utils.indexed_dataset]: reading pointers...
[03/10 15:53:47 lb.data.data_utils.indexed_dataset]: reading document index...
[03/10 15:53:47 lb.data.data_utils.indexed_dataset]: warming up data mmap file...
[03/10 15:53:48 lb.data.data_utils.indexed_dataset]: creating numpy buffer of mmap...
[03/10 15:53:48 lb.data.data_utils.indexed_dataset]: creating memory view of numpy buffer...
[03/10 15:53:48 lb.data.data_utils.indexed_dataset]: Finished creating indexed dataset in 0.082721 seconds
[03/10 15:53:48 lb.data.data_utils.indexed_dataset]: indexed dataset stats:
[03/10 15:53:48 lb.data.data_utils.indexed_dataset]: number of documents: 50000
[03/10 15:53:48 lb.data.data_utils.indexed_dataset]: number of sentences: 1249934
[03/10 15:53:48 lb.data.datasets.gpt_dataset]:  > loading doc-idx mapping from ./data_test/gpt_data/loss_compara_content_sentence_gpt-2_indexmap_28160ns_1024sl_1234s_doc_idx.npy
[03/10 15:53:48 lb.data.datasets.gpt_dataset]:  > loading sample-idx mapping from ./data_test/gpt_data/loss_compara_content_sentence_gpt-2_indexmap_28160ns_1024sl_1234s_sample_idx.npy
[03/10 15:53:48 lb.data.datasets.gpt_dataset]:  > loading shuffle-idx mapping from ./data_test/gpt_data/loss_compara_content_sentence_gpt-2_indexmap_28160ns_1024sl_1234s_shuffle_idx.npy
[03/10 15:53:48 lb.data.datasets.gpt_dataset]:     loaded indexed file in 0.005 seconds
[03/10 15:53:48 lb.data.datasets.gpt_dataset]:     total number of samples: 57333
[03/10 15:53:48 lb.data.datasets.gpt_dataset]:     total number of epochs: 1
[03/10 15:53:48 lb.data.datasets.gpt_dataset]:  > loading doc-idx mapping from ./data_test/gpt_data/loss_compara_content_sentence_gpt-2_indexmap_64ns_1024sl_1234s_doc_idx.npy
[03/10 15:53:48 lb.data.datasets.gpt_dataset]:  > loading sample-idx mapping from ./data_test/gpt_data/loss_compara_content_sentence_gpt-2_indexmap_64ns_1024sl_1234s_sample_idx.npy
[03/10 15:53:48 lb.data.datasets.gpt_dataset]:  > loading shuffle-idx mapping from ./data_test/gpt_data/loss_compara_content_sentence_gpt-2_indexmap_64ns_1024sl_1234s_shuffle_idx.npy
[03/10 15:53:48 lb.data.datasets.gpt_dataset]:     loaded indexed file in 0.001 seconds
[03/10 15:53:48 lb.data.datasets.gpt_dataset]:     total number of samples: 57333
[03/10 15:53:48 lb.data.datasets.gpt_dataset]:     total number of epochs: 1
[03/10 15:53:48 lb.data.datasets.gpt_dataset]:  > loading doc-idx mapping from ./data_test/gpt_data/loss_compara_content_sentence_gpt-2_indexmap_64ns_1024sl_1234s_doc_idx.npy
[03/10 15:53:48 lb.data.datasets.gpt_dataset]:  > loading sample-idx mapping from ./data_test/gpt_data/loss_compara_content_sentence_gpt-2_indexmap_64ns_1024sl_1234s_sample_idx.npy
[03/10 15:53:48 lb.data.datasets.gpt_dataset]:  > loading shuffle-idx mapping from ./data_test/gpt_data/loss_compara_content_sentence_gpt-2_indexmap_64ns_1024sl_1234s_shuffle_idx.npy
[03/10 15:53:48 lb.data.datasets.gpt_dataset]:     loaded indexed file in 0.001 seconds
[03/10 15:53:48 lb.data.datasets.gpt_dataset]:     total number of samples: 57333
[03/10 15:53:48 lb.data.datasets.gpt_dataset]:     total number of epochs: 1
[03/10 15:53:57 lb.engine.default]: Auto-scaling the config to train.train_iter=220, train.warmup_iter=0
[03/10 15:53:57 libai]: > Start building model...
[03/10 15:53:59 lb.engine.default]: Model:
GPTForPreTraining(
  (GPT_model): GPTModel(
    (embeddings): GPTEmbedding(
      (token_embeddings): VocabEmbedding(num_embeddings=50432, embedding_dim=1024)
      (position_embeddings): Embedding(num_embeddings=1024, embedding_dim=1024)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layers): ModuleList(
        (0): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (1): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (2): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (3): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (4): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (5): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (6): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (7): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (8): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (9): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (10): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (11): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (12): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (13): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (14): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (15): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (16): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (17): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (18): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (19): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (20): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (21): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (22): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
        (23): TransformerLayer(
          (drop_path): Identity()
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): MultiheadAttention(
            hidden_size=1024, num_heads=16, is_cross_attention=False
            (dropout): Dropout(p=0.1, inplace=False)
            (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col)
            (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0
            (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row)
          )
        )
      )
      (layernorm_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    )
    (lm_head): LMLogits()
  )
  (loss_func): GPTLoss(
    (lm_loss): ParallelCrossEntropyLoss()
  )
)
[03/10 15:53:59 libai]: >>> done with building model. Building time: 1.988 seconds
WARNING [03/10 15:53:59 lb.scheduler.lr_scheduler]: warmup iters equals to zero, return CosineLR
[03/10 15:53:59 lb.engine.trainer]: Starting training from iteration 0
W20230310 15:53:59.574290 3130659 eager_local_op_interpreter.cpp:256] Casting a local tensor to a global tensor with Broadcast sbp will modify the data of input! If you want to keep the input local tensor unchanged, please set the arg copy to True.
[03/10 15:53:59 lb.models.utils.graph_base]: Start compiling the train graph which may take some time. Please wait for a moment ...
W20230310 15:54:25.105031 3130668 insert_nccl_logical_op_pass.cpp:1088]  In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1032 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance.
W20230310 15:54:25.287127 3130660 insert_nccl_logical_op_pass.cpp:1088]  In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1032 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance.
W20230310 15:54:25.318681 3130661 insert_nccl_logical_op_pass.cpp:1088]  In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1032 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance.
W20230310 15:54:25.339896 3130663 insert_nccl_logical_op_pass.cpp:1088]  In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1032 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance.
W20230310 15:54:25.405133 3130659 insert_nccl_logical_op_pass.cpp:1088]  In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1032 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance.
W20230310 15:54:25.409137 3130662 insert_nccl_logical_op_pass.cpp:1088]  In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1032 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance.
W20230310 15:54:25.412039 3130666 insert_nccl_logical_op_pass.cpp:1088]  In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1032 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance.
W20230310 15:54:25.443776 3130664 insert_nccl_logical_op_pass.cpp:1088]  In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1032 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance.
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/03/10 16:03:07.982, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 94 %, 18 %, 12288 MiB, 7313 MiB, 4740 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/03/10 16:03:07.985, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 87 %, 15 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:07.984, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 94 %, 18 %, 12288 MiB, 7313 MiB, 4740 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/03/10 16:03:07.985, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 94 %, 18 %, 12288 MiB, 7313 MiB, 4740 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/03/10 16:03:07.987, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 97 %, 19 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:07.988, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 87 %, 15 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:07.989, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 87 %, 15 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:07.988, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 94 %, 18 %, 12288 MiB, 7313 MiB, 4740 MiB
2023/03/10 16:03:07.988, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 94 %, 18 %, 12288 MiB, 7313 MiB, 4740 MiB
2023/03/10 16:03:07.989, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 94 %, 18 %, 12288 MiB, 7313 MiB, 4740 MiB
2023/03/10 16:03:07.989, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 94 %, 18 %, 12288 MiB, 7313 MiB, 4740 MiB
2023/03/10 16:03:07.992, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 95 %, 19 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:07.995, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 97 %, 19 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:07.996, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 97 %, 19 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:07.996, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 87 %, 15 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:07.997, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 87 %, 15 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:07.998, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 87 %, 15 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:07.999, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 87 %, 15 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:08.002, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.002, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 95 %, 19 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:08.004, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 95 %, 19 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:08.005, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 97 %, 19 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:08.006, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 97 %, 19 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:08.008, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 97 %, 19 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:08.010, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 97 %, 19 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:08.012, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.013, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.016, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.017, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 95 %, 19 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:08.017, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 95 %, 19 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:08.018, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 95 %, 19 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:08.019, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 95 %, 19 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:08.021, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.022, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.023, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.024, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.024, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.025, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.025, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.028, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.029, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.030, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.031, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.031, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.032, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.033, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.036, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.038, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.039, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.039, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.040, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.041, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.046, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.046, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.046, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:08.047, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7481 MiB, 4572 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/03/10 16:03:13.147, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 96 %, 18 %, 12288 MiB, 7313 MiB, 4740 MiB
2023/03/10 16:03:13.148, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 98 %, 21 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:13.149, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 97 %, 19 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:13.150, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 96 %, 18 %, 12288 MiB, 7387 MiB, 4666 MiB
2023/03/10 16:03:13.151, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:13.152, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:13.153, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7481 MiB, 4572 MiB
2023/03/10 16:03:13.154, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7481 MiB, 4572 MiB
[03/10 16:03:18 lb.utils.events]:  eta: 0:10:26  iteration: 99/220  consumed_samples: 12800  total_loss: 7.262  time: 5.2183 s/iter  data_time: 0.0061 s/iter total_throughput: 24.53 samples/s lr: 8.74e-05  
[03/10 16:12:00 lb.utils.events]:  eta: 0:01:44  iteration: 199/220  consumed_samples: 25600  total_loss: 6.954  time: 5.2180 s/iter  data_time: 0.0066 s/iter total_throughput: 24.53 samples/s lr: 4.81e-06  
[03/10 16:13:44 lb.utils.events]:  eta: 0:00:00  iteration: 219/220  consumed_samples: 28160  total_loss: 6.702  time: 5.2180 s/iter  data_time: 0.0062 s/iter total_throughput: 24.53 samples/s lr: 1.51e-06  
[03/10 16:13:44 lb.engine.hooks]: Overall training speed: 218 iterations in 0:18:57 (5.2180 s / it)
[03/10 16:13:44 lb.engine.hooks]: Total training time: 0:18:57 (0:00:00 on hooks)
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
oneflow-version(git_commit)=0.9.1.dev20230309+cu117
oneflow-commit(git_commit)=1ea2bb7
oneflow-libai(git_commit)=50a973dc5de635b8613ad7666c073c763e238850