[03/10 22:44:13 libai]: Rank of current process: 0. World size: 8 [03/10 22:44:13 libai]: Command line arguments: Namespace(config_file='configs/bert_large_pretrain.py', eval_only=False, fast_dev_run=False, opts=['model.cfg.hidden_dropout_prob=0.1', 'model.cfg.attention_probs_dropout_prob=0.1', 'model.cfg.bias_dropout_fusion=true', 'model.cfg.hidden_layers=24', 'model.cfg.hidden_size=1024', 'model.cfg.num_attention_heads=16', 'model.cfg.intermediate_size=4096', 'model.cfg.ffn_hidden_size=4096', 'model.cfg.head_size=64', 'graph.enabled=true', 'train.dist.pipeline_num_layers=24', 'train.train_micro_batch_size=32', 'train.global_batch_size=512', 'train.dist.tensor_parallel_size=2', 'train.dist.pipeline_parallel_size=2', 'train.amp.enabled=true', 'train.activation_checkpoint.enabled=true', 'train.num_accumulation_steps=8', 'train.evaluation.enabled=false', 'train.train_iter=220', 'train.train_epoch=0', 'train.log_period=100', 'train.zero_optimization.enabled=true', 'train.zero_optimization.stage=2', 'train.load_weight=', 'train.output_dir=test_logs/oneflow-28/NVIDIA_GeForce_RTX_3080_Ti/1ea2bb7/LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP2_MP2_PP2_zerotrue_stage2_mbs32_gbs512_acc8_1n8g'], resume=False) [03/10 22:44:13 libai]: Contents of args.config_file=configs/bert_large_pretrain.py: from libai.config import LazyCall from libai.evaluation import PPLEvaluator from .common.models.bert import pretrain_model as model from .common.models.graph import graph from .common.train import train from .common.optim import optim from .common.data.bert_dataset import dataloader, tokenization vocab_file = "./data_test/bert_data/bert-base-chinese-vocab.txt" data_prefix = "./data_test/bert_data/loss_compara_content_sentence" tokenization.tokenizer.vocab_file = vocab_file dataloader.train.dataset[0].data_prefix = data_prefix dataloader.train.dataset[0].indexed_dataset.data_prefix = data_prefix # Bert-large model config model.cfg.num_attention_heads = 16 model.cfg.hidden_size = 768 model.cfg.hidden_layers = 8 train.input_placement_device = "cpu" train.dist.pipeline_num_layers = model.cfg.hidden_layers train.train_micro_batch_size = 16 train.amp.enabled = True for ds in dataloader.train.dataset:  ds.max_seq_length = model.cfg.max_position_embeddings train.evaluation.evaluator = LazyCall(PPLEvaluator)() train.output_dir = "output/bert_output" [03/10 22:44:13 libai]: Full config saved to test_logs/oneflow-28/NVIDIA_GeForce_RTX_3080_Ti/1ea2bb7/LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP2_MP2_PP2_zerotrue_stage2_mbs32_gbs512_acc8_1n8g/config.yaml [03/10 22:44:13 lb.engine.default]: > compiling dataset index builder ... make: Entering directory '/ssd/home/ouyangyu/libai_week_test/libai/libai/data/data_utils' make: Nothing to be done for 'default'. make: Leaving directory '/ssd/home/ouyangyu/libai_week_test/libai/libai/data/data_utils' [03/10 22:44:13 lb.engine.default]: >>> done with dataset index builder. Compilation time: 0.051 seconds [03/10 22:44:13 lb.engine.default]: >>> done with compiling. Compilation time: 0.053 seconds [03/10 22:44:13 lb.engine.default]: Prepare training, validating, testing set [03/10 22:44:13 lb.data.data_utils.indexed_dataset]: building dataset index ... [03/10 22:44:13 lb.data.data_utils.indexed_dataset]: warming up index mmap file... [03/10 22:44:13 lb.data.data_utils.indexed_dataset]: reading sizes... [03/10 22:44:13 lb.data.data_utils.indexed_dataset]: reading pointers... [03/10 22:44:13 lb.data.data_utils.indexed_dataset]: reading document index... [03/10 22:44:13 lb.data.data_utils.indexed_dataset]: warming up data mmap file... [03/10 22:44:13 lb.data.data_utils.indexed_dataset]: creating numpy buffer of mmap... [03/10 22:44:13 lb.data.data_utils.indexed_dataset]: creating memory view of numpy buffer... [03/10 22:44:13 lb.data.data_utils.indexed_dataset]: Finished creating indexed dataset in 0.071630 seconds [03/10 22:44:13 lb.data.data_utils.indexed_dataset]: indexed dataset stats: [03/10 22:44:13 lb.data.data_utils.indexed_dataset]: number of documents: 50000 [03/10 22:44:13 lb.data.data_utils.indexed_dataset]: number of sentences: 1249934 [03/10 22:44:13 lb.data.data_utils.dataset_utils]: > loading indexed mapping from ./data_test/bert_data/loss_compara_content_sentence_bert_indexmap_112640mns_509msl_0.10ssp_1234s.npy [03/10 22:44:13 lb.data.data_utils.dataset_utils]: loaded indexed file in 0.003 seconds [03/10 22:44:13 lb.data.data_utils.dataset_utils]: total number of samples: 113036 [03/10 22:44:13 lb.data.data_utils.dataset_utils]: > loading indexed mapping from ./data_test/bert_data/loss_compara_content_sentence_bert_indexmap_64mns_509msl_0.10ssp_1234s.npy [03/10 22:44:13 lb.data.data_utils.dataset_utils]: loaded indexed file in 0.000 seconds [03/10 22:44:13 lb.data.data_utils.dataset_utils]: total number of samples: 5884 [03/10 22:44:13 lb.data.data_utils.dataset_utils]: > loading indexed mapping from ./data_test/bert_data/loss_compara_content_sentence_bert_indexmap_64mns_509msl_0.10ssp_1234s.npy [03/10 22:44:13 lb.data.data_utils.dataset_utils]: loaded indexed file in 0.000 seconds [03/10 22:44:13 lb.data.data_utils.dataset_utils]: total number of samples: 5884 [03/10 22:44:23 lb.engine.default]: Auto-scaling the config to train.train_iter=220, train.warmup_iter=0 [03/10 22:44:23 libai]: > Start building model... [03/10 22:44:25 lb.engine.default]: Model: BertForPreTraining( (bert): BertModel( (embeddings): BertEmbeddings( (vocab_embeddings): VocabEmbedding(num_embeddings=21248, embedding_dim=1024) (position_embeddings): Embedding(num_embeddings=512, embedding_dim=1024) (tokentype_embeddings): Embedding(num_embeddings=2, embedding_dim=1024) (embedding_dropout): Dropout(p=0.1, inplace=False) ) (extended_attn_mask): BertExtendedAttnMask() (encoders): ModuleList( (0): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (1): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (2): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (3): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (4): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (5): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (6): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (7): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (8): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (9): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (10): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (11): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (12): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (13): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (14): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (15): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (16): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (17): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (18): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (19): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (20): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (21): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (22): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (23): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.1 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) ) (final_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (pooler): BertPooler( (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=col) (activation_func): Tanh() ) ) (cls_head): BertPreTrainingHeads( (predictions): BertLMPredictionHead( (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=data) (activation_func): GELU() (layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (seq_relationship): Linear1D(in_features=1024, out_features=2, bias=True, parallel=data) (lm_logits): LMLogits() (loss_func): BertLoss( (lm_loss): ParallelCrossEntropyLoss() ) ) ) [03/10 22:44:25 libai]: >>> done with building model. Building time: 1.971 seconds WARNING [03/10 22:44:25 lb.scheduler.lr_scheduler]: warmup iters equals to zero, return CosineLR [03/10 22:44:25 lb.engine.trainer]: Starting training from iteration 0 W20230310 22:44:25.951707 3161278 eager_local_op_interpreter.cpp:256] Casting a local tensor to a global tensor with Broadcast sbp will modify the data of input! If you want to keep the input local tensor unchanged, please set the arg copy to True. [03/10 22:44:26 lb.models.utils.graph_base]: Start compiling the train graph which may take some time. Please wait for a moment ... W20230310 22:44:53.047030 3161282 insert_nccl_logical_op_pass.cpp:1088] In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1115 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance. W20230310 22:44:53.140492 3161287 insert_nccl_logical_op_pass.cpp:1088] In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1115 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance. W20230310 22:44:53.241844 3161281 insert_nccl_logical_op_pass.cpp:1088] In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1115 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance. W20230310 22:44:53.681345 3161278 insert_nccl_logical_op_pass.cpp:1088] In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1115 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance. W20230310 22:44:53.706555 3161280 insert_nccl_logical_op_pass.cpp:1088] In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1115 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance. W20230310 22:44:53.785876 3161285 insert_nccl_logical_op_pass.cpp:1088] In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1115 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance. W20230310 22:44:53.869421 3161283 insert_nccl_logical_op_pass.cpp:1088] In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1115 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance. W20230310 22:44:53.905472 3161279 insert_nccl_logical_op_pass.cpp:1088] In Graph: GraphBase_0 Placement: cuda-@0:0-@1:1-@2:2-@3:3 the total_op_num = 1115 and has 2 different nccl stream which is possible to trigger cuda stream kernel launch upper limit. So the nccl logical kernel will from async to sync exec, which may affect performance. timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/03/10 22:59:41.559, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 88 %, 15 %, 12288 MiB, 6691 MiB, 5362 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/03/10 22:59:41.562, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6765 MiB, 5288 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/03/10 22:59:41.562, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 88 %, 15 %, 12288 MiB, 6691 MiB, 5362 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/03/10 22:59:41.562, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 88 %, 15 %, 12288 MiB, 6691 MiB, 5362 MiB 2023/03/10 22:59:41.563, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6765 MiB, 5288 MiB 2023/03/10 22:59:41.564, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 88 %, 15 %, 12288 MiB, 6691 MiB, 5362 MiB 2023/03/10 22:59:41.565, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6765 MiB, 5288 MiB 2023/03/10 22:59:41.564, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 88 %, 15 %, 12288 MiB, 6691 MiB, 5362 MiB 2023/03/10 22:59:41.565, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 88 %, 15 %, 12288 MiB, 6691 MiB, 5362 MiB 2023/03/10 22:59:41.567, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6765 MiB, 5288 MiB 2023/03/10 22:59:41.565, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 88 %, 15 %, 12288 MiB, 6691 MiB, 5362 MiB 2023/03/10 22:59:41.570, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6769 MiB, 5284 MiB 2023/03/10 22:59:41.574, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6765 MiB, 5288 MiB 2023/03/10 22:59:41.575, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6765 MiB, 5288 MiB 2023/03/10 22:59:41.576, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6765 MiB, 5288 MiB 2023/03/10 22:59:41.577, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 79 %, 11 %, 12288 MiB, 6765 MiB, 5288 MiB 2023/03/10 22:59:41.578, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6765 MiB, 5288 MiB 2023/03/10 22:59:41.578, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 79 %, 11 %, 12288 MiB, 6765 MiB, 5288 MiB 2023/03/10 22:59:41.580, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB 2023/03/10 22:59:41.583, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6765 MiB, 5288 MiB 2023/03/10 22:59:41.585, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 78 %, 11 %, 12288 MiB, 6769 MiB, 5284 MiB 2023/03/10 22:59:41.587, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6765 MiB, 5288 MiB 2023/03/10 22:59:41.593, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6765 MiB, 5288 MiB 2023/03/10 22:59:41.595, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 78 %, 11 %, 12288 MiB, 6769 MiB, 5284 MiB 2023/03/10 22:59:41.596, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 18 %, 12288 MiB, 6765 MiB, 5288 MiB 2023/03/10 22:59:41.597, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7571 MiB, 4482 MiB 2023/03/10 22:59:41.600, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 78 %, 11 %, 12288 MiB, 6769 MiB, 5284 MiB 2023/03/10 22:59:41.601, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB 2023/03/10 22:59:41.601, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 78 %, 11 %, 12288 MiB, 6769 MiB, 5284 MiB 2023/03/10 22:59:41.602, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 78 %, 11 %, 12288 MiB, 6769 MiB, 5284 MiB 2023/03/10 22:59:41.603, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB 2023/03/10 22:59:41.604, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 78 %, 11 %, 12288 MiB, 6769 MiB, 5284 MiB 2023/03/10 22:59:41.605, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7563 MiB, 4490 MiB 2023/03/10 22:59:41.608, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB 2023/03/10 22:59:41.609, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7571 MiB, 4482 MiB 2023/03/10 22:59:41.610, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB 2023/03/10 22:59:41.612, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB 2023/03/10 22:59:41.613, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7571 MiB, 4482 MiB 2023/03/10 22:59:41.613, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB 2023/03/10 22:59:41.614, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 6 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB 2023/03/10 22:59:41.616, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7571 MiB, 4482 MiB 2023/03/10 22:59:41.617, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7563 MiB, 4490 MiB 2023/03/10 22:59:41.617, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7571 MiB, 4482 MiB 2023/03/10 22:59:41.618, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7571 MiB, 4482 MiB 2023/03/10 22:59:41.619, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7563 MiB, 4490 MiB 2023/03/10 22:59:41.619, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7571 MiB, 4482 MiB 2023/03/10 22:59:41.621, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7563 MiB, 4490 MiB 2023/03/10 22:59:41.625, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 6 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB 2023/03/10 22:59:41.626, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7563 MiB, 4490 MiB 2023/03/10 22:59:41.627, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7563 MiB, 4490 MiB 2023/03/10 22:59:41.627, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 6 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB 2023/03/10 22:59:41.628, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 3 %, 1 %, 12288 MiB, 7563 MiB, 4490 MiB 2023/03/10 22:59:41.629, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 6 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB 2023/03/10 22:59:41.631, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 6 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB 2023/03/10 22:59:41.632, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 6 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB 2023/03/10 22:59:41.633, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 6 %, 1 %, 12288 MiB, 7583 MiB, 4470 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/03/10 22:59:50.472, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 62 %, 10 %, 12288 MiB, 6691 MiB, 5362 MiB 2023/03/10 22:59:50.473, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 99 %, 16 %, 12288 MiB, 6765 MiB, 5288 MiB 2023/03/10 22:59:50.474, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 95 %, 16 %, 12288 MiB, 6765 MiB, 5288 MiB 2023/03/10 22:59:50.475, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 97 %, 16 %, 12288 MiB, 6769 MiB, 5284 MiB 2023/03/10 22:59:50.476, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7583 MiB, 4470 MiB 2023/03/10 22:59:50.477, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 7571 MiB, 4482 MiB 2023/03/10 22:59:50.478, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 7563 MiB, 4490 MiB 2023/03/10 22:59:50.479, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7583 MiB, 4470 MiB [03/10 22:59:59 lb.utils.events]: eta: 0:17:52 iteration: 99/220 consumed_samples: 51200 total_loss: 7.952 lm_loss: 7.24 sop_loss: 0.7044 time: 8.9356 s/iter data_time: 0.0182 s/iter total_throughput: 57.30 samples/s lr: 5.82e-05 [03/10 23:14:52 lb.utils.events]: eta: 0:02:58 iteration: 199/220 consumed_samples: 102400 total_loss: 7.903 lm_loss: 7.206 sop_loss: 0.6956 time: 8.9317 s/iter data_time: 0.0154 s/iter total_throughput: 57.32 samples/s lr: 3.21e-06 [03/10 23:17:50 lb.utils.events]: eta: 0:00:00 iteration: 219/220 consumed_samples: 112640 total_loss: 7.895 lm_loss: 7.2 sop_loss: 0.6946 time: 8.9319 s/iter data_time: 0.0167 s/iter total_throughput: 57.32 samples/s lr: 1.01e-06 [03/10 23:17:50 lb.engine.hooks]: Overall training speed: 218 iterations in 0:32:27 (8.9319 s / it) [03/10 23:17:50 lb.engine.hooks]: Total training time: 0:32:27 (0:00:00 on hooks) ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** oneflow-version(git_commit)=0.9.1.dev20230309+cu117 oneflow-commit(git_commit)=1ea2bb7 oneflow-libai(git_commit)=50a973dc5de635b8613ad7666c073c763e238850