[03/05 16:17:17 libai]: Rank of current process: 0. World size: 8 [03/05 16:17:17 libai]: Command line arguments: Namespace(config_file='configs/bert_large_pretrain.py', eval_only=False, fast_dev_run=False, opts=['model.cfg.hidden_dropout_prob=0.1', 'model.cfg.attention_probs_dropout_prob=0.1', 'model.cfg.bias_dropout_fusion=true', 'model.cfg.hidden_layers=24', 'model.cfg.hidden_size=1024', 'model.cfg.num_attention_heads=16', 'model.cfg.intermediate_size=4096', 'model.cfg.ffn_hidden_size=4096', 'model.cfg.head_size=64', 'graph.enabled=true', 'train.dist.pipeline_num_layers=24', 'train.train_micro_batch_size=32', 'train.global_batch_size=512', 'train.dist.tensor_parallel_size=1', 'train.dist.pipeline_parallel_size=4', 'train.amp.enabled=true', 'train.activation_checkpoint.enabled=true', 'train.num_accumulation_steps=8', 'train.evaluation.enabled=false', 'train.train_iter=220', 'train.train_epoch=0', 'train.log_period=100', 'train.zero_optimization.enabled=true', 'train.zero_optimization.stage=2', 'train.load_weight=', 'train.output_dir=test_logs/oneflow-28/NVIDIA_GeForce_RTX_3080_Ti/7d07caf/LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP2_MP1_PP4_zerotrue_stage2_mbs32_gbs512_acc8_1n8g'], resume=False) [03/05 16:17:17 libai]: Contents of args.config_file=configs/bert_large_pretrain.py: from libai.config import LazyCall from libai.evaluation import PPLEvaluator from .common.models.bert import pretrain_model as model from .common.models.graph import graph from .common.train import train from .common.optim import optim from .common.data.bert_dataset import dataloader, tokenization vocab_file = "./data_test/bert_data/bert-base-chinese-vocab.txt" data_prefix = "./data_test/bert_data/loss_compara_content_sentence" tokenization.tokenizer.vocab_file = vocab_file dataloader.train.dataset[0].data_prefix = data_prefix dataloader.train.dataset[0].indexed_dataset.data_prefix = data_prefix # Bert-large model config model.cfg.num_attention_heads = 16 model.cfg.hidden_size = 768 model.cfg.hidden_layers = 8 train.input_placement_device = "cpu" train.dist.pipeline_num_layers = model.cfg.hidden_layers train.train_micro_batch_size = 16 train.amp.enabled = True for ds in dataloader.train.dataset:  ds.max_seq_length = model.cfg.max_position_embeddings train.evaluation.evaluator = LazyCall(PPLEvaluator)() train.output_dir = "output/bert_output" [03/05 16:17:18 libai]: Full config saved to test_logs/oneflow-28/NVIDIA_GeForce_RTX_3080_Ti/7d07caf/LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP2_MP1_PP4_zerotrue_stage2_mbs32_gbs512_acc8_1n8g/config.yaml [03/05 16:17:18 lb.engine.default]: > compiling dataset index builder ... make: Entering directory '/ssd/home/ouyangyu/libai_week_test/libai/libai/data/data_utils' make: Nothing to be done for 'default'. make: Leaving directory '/ssd/home/ouyangyu/libai_week_test/libai/libai/data/data_utils' [03/05 16:17:18 lb.engine.default]: >>> done with dataset index builder. Compilation time: 0.054 seconds [03/05 16:17:18 lb.engine.default]: >>> done with compiling. Compilation time: 0.055 seconds [03/05 16:17:18 lb.engine.default]: Prepare training, validating, testing set [03/05 16:17:18 lb.data.data_utils.indexed_dataset]: building dataset index ... [03/05 16:17:18 lb.data.data_utils.indexed_dataset]: warming up index mmap file... [03/05 16:17:18 lb.data.data_utils.indexed_dataset]: reading sizes... [03/05 16:17:18 lb.data.data_utils.indexed_dataset]: reading pointers... [03/05 16:17:18 lb.data.data_utils.indexed_dataset]: reading document index... [03/05 16:17:18 lb.data.data_utils.indexed_dataset]: warming up data mmap file... [03/05 16:17:18 lb.data.data_utils.indexed_dataset]: creating numpy buffer of mmap... [03/05 16:17:18 lb.data.data_utils.indexed_dataset]: creating memory view of numpy buffer... [03/05 16:17:18 lb.data.data_utils.indexed_dataset]: Finished creating indexed dataset in 0.076941 seconds [03/05 16:17:18 lb.data.data_utils.indexed_dataset]: indexed dataset stats: [03/05 16:17:18 lb.data.data_utils.indexed_dataset]: number of documents: 50000 [03/05 16:17:18 lb.data.data_utils.indexed_dataset]: number of sentences: 1249934 [03/05 16:17:18 lb.data.data_utils.dataset_utils]: > WARNING: could not find index map file ./data_test/bert_data/loss_compara_content_sentence_bert_indexmap_112640mns_509msl_0.10ssp_1234s.npy, building the indices on rank 0 ... [03/05 16:17:18 lb.data.data_utils.dataset_utils]: > building samples index mapping for bert ... using uint32 for data mapping... using: number of documents: 47450 sentences range: [0, 1188464) total number of sentences: 1188464 number of epochs: 2147483646 maximum number of samples: 112640 maximum sequence length: 509 short sequence probability: 0.1 short sequence ration (1/prob): 10 seed: 1234 reached 112640 samples after 1 epochs ... number of empty documents: 0 number of documents with one sentence: 711 number of documents with long sentences: 2092 will create mapping for 113036 samples [03/05 16:17:18 lb.data.data_utils.dataset_utils]: > done building samples index maping [03/05 16:17:18 lb.data.data_utils.dataset_utils]: > saved the index mapping in ./data_test/bert_data/loss_compara_content_sentence_bert_indexmap_112640mns_509msl_0.10ssp_1234s.npy [03/05 16:17:18 lb.data.data_utils.dataset_utils]: > elapsed time to build and save samples mapping (seconds): 0.014842 [03/05 16:17:18 lb.data.data_utils.dataset_utils]: > loading indexed mapping from ./data_test/bert_data/loss_compara_content_sentence_bert_indexmap_112640mns_509msl_0.10ssp_1234s.npy [03/05 16:17:18 lb.data.data_utils.dataset_utils]: loaded indexed file in 0.003 seconds [03/05 16:17:18 lb.data.data_utils.dataset_utils]: total number of samples: 113036 [03/05 16:17:18 lb.data.data_utils.dataset_utils]: > WARNING: could not find index map file ./data_test/bert_data/loss_compara_content_sentence_bert_indexmap_64mns_509msl_0.10ssp_1234s.npy, building the indices on rank 0 ... [03/05 16:17:18 lb.data.data_utils.dataset_utils]: > building samples index mapping for bert ... using uint32 for data mapping... using: number of documents: 2500 sentences range: [1188464, 1248643) total number of sentences: 60179 number of epochs: 2147483646 maximum number of samples: 64 maximum sequence length: 509 short sequence probability: 0.1 short sequence ration (1/prob): 10 seed: 1234 reached 64 samples after 1 epochs ... number of empty documents: 0 number of documents with one sentence: 51 number of documents with long sentences: 89 will create mapping for 5884 samples [03/05 16:17:18 lb.data.data_utils.dataset_utils]: > done building samples index maping [03/05 16:17:18 lb.data.data_utils.dataset_utils]: > saved the index mapping in ./data_test/bert_data/loss_compara_content_sentence_bert_indexmap_64mns_509msl_0.10ssp_1234s.npy [03/05 16:17:18 lb.data.data_utils.dataset_utils]: > elapsed time to build and save samples mapping (seconds): 0.001019 [03/05 16:17:18 lb.data.data_utils.dataset_utils]: > loading indexed mapping from ./data_test/bert_data/loss_compara_content_sentence_bert_indexmap_64mns_509msl_0.10ssp_1234s.npy [03/05 16:17:18 lb.data.data_utils.dataset_utils]: loaded indexed file in 0.000 seconds [03/05 16:17:18 lb.data.data_utils.dataset_utils]: total number of samples: 5884 [03/05 16:17:18 lb.data.data_utils.dataset_utils]: > loading indexed mapping from ./data_test/bert_data/loss_compara_content_sentence_bert_indexmap_64mns_509msl_0.10ssp_1234s.npy [03/05 16:17:18 lb.data.data_utils.dataset_utils]: loaded indexed file in 0.000 seconds [03/05 16:17:18 lb.data.data_utils.dataset_utils]: total number of samples: 5884 [03/05 16:17:27 lb.engine.default]: Auto-scaling the config to train.train_iter=220, train.warmup_iter=0 [03/05 16:17:27 libai]: > Start building model... [03/05 16:17:29 lb.engine.default]: Model: BertForPreTraining( (bert): BertModel( (embeddings): BertEmbeddings( (vocab_embeddings): VocabEmbedding(num_embeddings=21248, embedding_dim=1024) (position_embeddings): Embedding(num_embeddings=512, embedding_dim=1024) (tokentype_embeddings): Embedding(num_embeddings=2, embedding_dim=1024) (embedding_dropout): Dropout(p=0.1, inplace=False) ) (extended_attn_mask): BertExtendedAttnMask() (encoders): ModuleList( (0): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (1): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (2): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (3): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (4): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (5): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (6): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (7): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (8): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (9): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (10): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (11): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (12): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (13): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (14): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (15): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (16): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (17): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (18): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (19): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (20): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (21): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (22): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (23): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) ) (final_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (pooler): BertPooler( (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=col) (activation_func): Tanh() ) ) (cls_head): BertPreTrainingHeads( (predictions): BertLMPredictionHead( (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=data) (activation_func): GELU() (layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (seq_relationship): Linear1D(in_features=1024, out_features=2, bias=True, parallel=data) (lm_logits): LMLogits() (loss_func): BertLoss( (lm_loss): ParallelCrossEntropyLoss() ) ) ) [03/05 16:17:30 libai]: >>> done with building model. Building time: 2.211 seconds WARNING [03/05 16:17:30 lb.scheduler.lr_scheduler]: warmup iters equals to zero, return CosineLR [03/05 16:17:30 lb.engine.trainer]: Starting training from iteration 0 [03/05 16:17:30 lb.models.utils.graph_base]: Start compiling the train graph which may take some time. Please wait for a moment ... timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/03/05 16:22:50.333, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 93 %, 48 %, 12288 MiB, 4836 MiB, 7217 MiB 2023/03/05 16:22:50.334, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 93 %, 48 %, 12288 MiB, 4836 MiB, 7217 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/03/05 16:22:50.337, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 100 %, 48 %, 12288 MiB, 5419 MiB, 6634 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/03/05 16:22:50.338, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 100 %, 48 %, 12288 MiB, 5419 MiB, 6634 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/03/05 16:22:50.340, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 5775 MiB, 6278 MiB 2023/03/05 16:22:50.339, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 93 %, 48 %, 12288 MiB, 4836 MiB, 7217 MiB 2023/03/05 16:22:50.340, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 93 %, 48 %, 12288 MiB, 4836 MiB, 7217 MiB 2023/03/05 16:22:50.342, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 5775 MiB, 6278 MiB 2023/03/05 16:22:50.343, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 93 %, 48 %, 12288 MiB, 4836 MiB, 7217 MiB 2023/03/05 16:22:50.344, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 5775 MiB, 6278 MiB 2023/03/05 16:22:50.344, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 100 %, 48 %, 12288 MiB, 5419 MiB, 6634 MiB 2023/03/05 16:22:50.345, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 100 %, 48 %, 12288 MiB, 5419 MiB, 6634 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/03/05 16:22:50.346, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 5775 MiB, 6278 MiB 2023/03/05 16:22:50.347, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 100 %, 48 %, 12288 MiB, 5419 MiB, 6634 MiB 2023/03/05 16:22:50.348, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 5793 MiB, 6260 MiB 2023/03/05 16:22:50.349, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 5775 MiB, 6278 MiB 2023/03/05 16:22:50.350, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 5775 MiB, 6278 MiB 2023/03/05 16:22:50.351, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 93 %, 48 %, 12288 MiB, 4836 MiB, 7217 MiB 2023/03/05 16:22:50.352, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 5793 MiB, 6260 MiB 2023/03/05 16:22:50.353, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 5775 MiB, 6278 MiB 2023/03/05 16:22:50.354, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 5793 MiB, 6260 MiB 2023/03/05 16:22:50.354, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 5775 MiB, 6278 MiB 2023/03/05 16:22:50.357, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 5775 MiB, 6278 MiB 2023/03/05 16:22:50.358, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 100 %, 48 %, 12288 MiB, 5419 MiB, 6634 MiB 2023/03/05 16:22:50.360, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 5793 MiB, 6260 MiB 2023/03/05 16:22:50.361, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 5775 MiB, 6278 MiB 2023/03/05 16:22:50.362, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 8743 MiB, 3310 MiB 2023/03/05 16:22:50.362, NVIDIA GeForce RTX 3080 Ti, 515.65.01timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] , 2 %, 1 %, 12288 MiB, 5793 MiB, 6260 MiB 2023/03/05 16:22:50.363, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 5793 MiB, 6260 MiB 2023/03/05 16:22:50.365, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 5775 MiB, 6278 MiB 2023/03/05 16:22:50.366, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 8743 MiB, 3310 MiB 2023/03/05 16:22:50.372, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 5793 MiB, 6260 MiB 2023/03/05 16:22:50.374, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 8743 MiB, 3310 MiB 2023/03/05 16:22:50.375, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 4 %, 1 %, 12288 MiB, 5793 MiB, 6260 MiB 2023/03/05 16:22:50.375, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 93 %, 48 %, 12288 MiB, 4836 MiB, 7217 MiB 2023/03/05 16:22:50.376, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 4 %, 1 %, 12288 MiB, 5793 MiB, 6260 MiB 2023/03/05 16:22:50.378, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 5775 MiB, 6278 MiB 2023/03/05 16:22:50.379, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 8743 MiB, 3310 MiB 2023/03/05 16:22:50.380, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 4 %, 1 %, 12288 MiB, 5793 MiB, 6260 MiB 2023/03/05 16:22:50.382, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 8743 MiB, 3310 MiB 2023/03/05 16:22:50.382, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 100 %, 48 %, 12288 MiB, 5419 MiB, 6634 MiB 2023/03/05 16:22:50.383, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 8743 MiB, 3310 MiB 2023/03/05 16:22:50.384, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 5793 MiB, 6260 MiB 2023/03/05 16:22:50.386, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 8743 MiB, 3310 MiB 2023/03/05 16:22:50.387, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB2023/03/05 16:22:50.388, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 5775 MiB, 6278 MiB , 8743 MiB, 3310 MiB 2023/03/05 16:22:50.388, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 8743 MiB, 3310 MiB 2023/03/05 16:22:50.389, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 4 %, 1 %, 12288 MiB, 5793 MiB, 6260 MiB 2023/03/05 16:22:50.391, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 7 %, 1 %, 12288 MiB, 8743 MiB, 3310 MiB 2023/03/05 16:22:50.392, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 5775 MiB, 6278 MiB 2023/03/05 16:22:50.396, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 8743 MiB, 3310 MiB 2023/03/05 16:22:50.401, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 1 %, 12288 MiB, 5793 MiB, 6260 MiB 2023/03/05 16:22:50.405, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 7 %, 1 %, 12288 MiB, 8743 MiB, 3310 MiB 2023/03/05 16:22:50.413, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 4 %, 1 %, 12288 MiB, 5793 MiB, 6260 MiB 2023/03/05 16:22:50.423, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 8743 MiB, 3310 MiB 2023/03/05 16:22:50.435, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 7 %, 1 %, 12288 MiB, 8743 MiB, 3310 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/03/05 16:22:53.349, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 84 %, 43 %, 12288 MiB, 4836 MiB, 7217 MiB 2023/03/05 16:22:53.350, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 85 %, 45 %, 12288 MiB, 5419 MiB, 6634 MiB 2023/03/05 16:22:53.351, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 5775 MiB, 6278 MiB 2023/03/05 16:22:53.352, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 20 %, 6 %, 12288 MiB, 5775 MiB, 6278 MiB 2023/03/05 16:22:53.353, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 5793 MiB, 6260 MiB 2023/03/05 16:22:53.353, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 5793 MiB, 6260 MiB 2023/03/05 16:22:53.354, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 8743 MiB, 3310 MiB 2023/03/05 16:22:53.355, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 8743 MiB, 3310 MiB [03/05 16:22:56 lb.utils.events]: eta: 0:06:03 iteration: 99/220 consumed_samples: 51200 total_loss: 7.944 lm_loss: 7.242 sop_loss: 0.696 time: 3.0181 s/iter data_time: 0.0111 s/iter total_throughput: 169.64 samples/s lr: 5.82e-05 [03/05 16:27:59 lb.utils.events]: eta: 0:01:00 iteration: 199/220 consumed_samples: 102400 total_loss: 7.902 lm_loss: 7.207 sop_loss: 0.6948 time: 3.0246 s/iter data_time: 0.0114 s/iter total_throughput: 169.28 samples/s lr: 3.21e-06 [03/05 16:29:00 lb.utils.events]: eta: 0:00:00 iteration: 219/220 consumed_samples: 112640 total_loss: 7.896 lm_loss: 7.201 sop_loss: 0.6945 time: 3.0247 s/iter data_time: 0.0121 s/iter total_throughput: 169.27 samples/s lr: 1.01e-06 [03/05 16:29:00 lb.engine.hooks]: Overall training speed: 218 iterations in 0:10:59 (3.0247 s / it) [03/05 16:29:00 lb.engine.hooks]: Total training time: 0:10:59 (0:00:00 on hooks) ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** oneflow-version(git_commit)=0.9.1.dev20230304+cu117 oneflow-commit(git_commit)=7d07caf oneflow-libai(git_commit)=50a973dc5de635b8613ad7666c073c763e238850