[03/10 14:31:00 libai]: Rank of current process: 0. World size: 8 [03/10 14:31:00 libai]: Command line arguments: Namespace(config_file='configs/gpt2_pretrain.py', eval_only=False, fast_dev_run=False, opts=['model.cfg.hidden_dropout_prob=0.1', 'model.cfg.attention_probs_dropout_prob=0.1', 'model.cfg.bias_dropout_fusion=true', 'model.cfg.hidden_layers=48', 'model.cfg.hidden_size=1024', 'model.cfg.num_attention_heads=16', 'model.cfg.intermediate_size=4096', 'model.cfg.ffn_hidden_size=4096', 'model.cfg.head_size=64', 'graph.enabled=true', 'train.dist.pipeline_num_layers=48', 'train.train_micro_batch_size=6', 'train.global_batch_size=96', 'train.dist.tensor_parallel_size=1', 'train.dist.pipeline_parallel_size=8', 'train.amp.enabled=true', 'train.activation_checkpoint.enabled=true', 'train.num_accumulation_steps=16', 'train.evaluation.enabled=false', 'train.train_iter=220', 'train.train_epoch=0', 'train.log_period=100', 'train.zero_optimization.enabled=false', 'train.zero_optimization.stage=0', 'train.load_weight=', 'train.output_dir=test_logs/oneflow-28/NVIDIA_GeForce_RTX_3080_Ti/1ea2bb7/LibAI_gpt2_pretrain_graph_nl48_nah16_hs1024_FP16_actrue_DP1_MP1_PP8_zerofalse_stage0_mbs6_gbs96_acc16_1n8g'], resume=False) [03/10 14:31:00 libai]: Contents of args.config_file=configs/gpt2_pretrain.py: from libai.config import LazyCall from libai.evaluation import PPLEvaluator from .common.models.gpt import pretrain_model as model from .common.train import train from .common.optim import optim from .common.data.gpt_dataset import dataloader, tokenization from .common.models.graph import graph vocab_file = "./data_test/gpt_data/gpt2-vocab.json" merge_files = "./data_test/gpt_data/gpt2-merges.txt" data_prefix = "./data_test/gpt_data/loss_compara_content_sentence" tokenization.tokenizer.vocab_file = vocab_file tokenization.tokenizer.merges_file = merge_files dataloader.train.dataset[0].data_prefix = data_prefix dataloader.train.dataset[0].indexed_dataset.data_prefix = data_prefix # GPT-2 model config model.cfg.embedding_dropout_prob = 0.1 model.cfg.attention_dropout_prob = 0.1 model.cfg.num_attention_heads = 16 model.cfg.hidden_size = 384 model.cfg.ffn_hidden_size = 1536 model.cfg.hidden_layers = 6 model.cfg.max_seq_length = 1024 train.input_placement_device = "cpu" train.dist.pipeline_num_layers = model.cfg.hidden_layers for ds in dataloader.train.dataset:  ds.max_seq_length = model.cfg.max_seq_length optim.lr = 1.5e-4 train.train_micro_batch_size = 4 train.amp.enabled = True train.evaluation.evaluator = LazyCall(PPLEvaluator)() train.output_dir = "./output/gpt2_output" [03/10 14:31:00 libai]: Full config saved to test_logs/oneflow-28/NVIDIA_GeForce_RTX_3080_Ti/1ea2bb7/LibAI_gpt2_pretrain_graph_nl48_nah16_hs1024_FP16_actrue_DP1_MP1_PP8_zerofalse_stage0_mbs6_gbs96_acc16_1n8g/config.yaml [03/10 14:31:00 lb.engine.default]: > compiling dataset index builder ... make: Entering directory '/ssd/home/ouyangyu/libai_week_test/libai/libai/data/data_utils' make: Nothing to be done for 'default'. make: Leaving directory '/ssd/home/ouyangyu/libai_week_test/libai/libai/data/data_utils' [03/10 14:31:00 lb.engine.default]: >>> done with dataset index builder. Compilation time: 0.056 seconds [03/10 14:31:00 lb.engine.default]: >>> done with compiling. Compilation time: 0.057 seconds [03/10 14:31:00 lb.engine.default]: Prepare training, validating, testing set [03/10 14:31:00 lb.data.data_utils.indexed_dataset]: building dataset index ... [03/10 14:31:00 lb.data.data_utils.indexed_dataset]: warming up index mmap file... [03/10 14:31:00 lb.data.data_utils.indexed_dataset]: reading sizes... [03/10 14:31:00 lb.data.data_utils.indexed_dataset]: reading pointers... [03/10 14:31:00 lb.data.data_utils.indexed_dataset]: reading document index... [03/10 14:31:00 lb.data.data_utils.indexed_dataset]: warming up data mmap file... [03/10 14:31:00 lb.data.data_utils.indexed_dataset]: creating numpy buffer of mmap... [03/10 14:31:00 lb.data.data_utils.indexed_dataset]: creating memory view of numpy buffer... [03/10 14:31:00 lb.data.data_utils.indexed_dataset]: Finished creating indexed dataset in 0.073169 seconds [03/10 14:31:00 lb.data.data_utils.indexed_dataset]: indexed dataset stats: [03/10 14:31:00 lb.data.data_utils.indexed_dataset]: number of documents: 50000 [03/10 14:31:00 lb.data.data_utils.indexed_dataset]: number of sentences: 1249934 [03/10 14:31:00 lb.data.datasets.gpt_dataset]: > loading doc-idx mapping from ./data_test/gpt_data/loss_compara_content_sentence_gpt-2_indexmap_21120ns_1024sl_1234s_doc_idx.npy [03/10 14:31:00 lb.data.datasets.gpt_dataset]: > loading sample-idx mapping from ./data_test/gpt_data/loss_compara_content_sentence_gpt-2_indexmap_21120ns_1024sl_1234s_sample_idx.npy [03/10 14:31:00 lb.data.datasets.gpt_dataset]: > loading shuffle-idx mapping from ./data_test/gpt_data/loss_compara_content_sentence_gpt-2_indexmap_21120ns_1024sl_1234s_shuffle_idx.npy [03/10 14:31:00 lb.data.datasets.gpt_dataset]: loaded indexed file in 0.005 seconds [03/10 14:31:00 lb.data.datasets.gpt_dataset]: total number of samples: 57333 [03/10 14:31:00 lb.data.datasets.gpt_dataset]: total number of epochs: 1 [03/10 14:31:00 lb.data.datasets.gpt_dataset]: > loading doc-idx mapping from ./data_test/gpt_data/loss_compara_content_sentence_gpt-2_indexmap_32ns_1024sl_1234s_doc_idx.npy [03/10 14:31:00 lb.data.datasets.gpt_dataset]: > loading sample-idx mapping from ./data_test/gpt_data/loss_compara_content_sentence_gpt-2_indexmap_32ns_1024sl_1234s_sample_idx.npy [03/10 14:31:00 lb.data.datasets.gpt_dataset]: > loading shuffle-idx mapping from ./data_test/gpt_data/loss_compara_content_sentence_gpt-2_indexmap_32ns_1024sl_1234s_shuffle_idx.npy [03/10 14:31:00 lb.data.datasets.gpt_dataset]: loaded indexed file in 0.001 seconds [03/10 14:31:00 lb.data.datasets.gpt_dataset]: total number of samples: 57333 [03/10 14:31:00 lb.data.datasets.gpt_dataset]: total number of epochs: 1 [03/10 14:31:00 lb.data.datasets.gpt_dataset]: > loading doc-idx mapping from ./data_test/gpt_data/loss_compara_content_sentence_gpt-2_indexmap_32ns_1024sl_1234s_doc_idx.npy [03/10 14:31:00 lb.data.datasets.gpt_dataset]: > loading sample-idx mapping from ./data_test/gpt_data/loss_compara_content_sentence_gpt-2_indexmap_32ns_1024sl_1234s_sample_idx.npy [03/10 14:31:00 lb.data.datasets.gpt_dataset]: > loading shuffle-idx mapping from ./data_test/gpt_data/loss_compara_content_sentence_gpt-2_indexmap_32ns_1024sl_1234s_shuffle_idx.npy [03/10 14:31:00 lb.data.datasets.gpt_dataset]: loaded indexed file in 0.001 seconds [03/10 14:31:00 lb.data.datasets.gpt_dataset]: total number of samples: 57333 [03/10 14:31:00 lb.data.datasets.gpt_dataset]: total number of epochs: 1 [03/10 14:31:09 lb.engine.default]: Auto-scaling the config to train.train_iter=220, train.warmup_iter=0 [03/10 14:31:09 libai]: > Start building model... [03/10 14:31:12 lb.engine.default]: Model: GPTForPreTraining( (GPT_model): GPTModel( (embeddings): GPTEmbedding( (token_embeddings): VocabEmbedding(num_embeddings=50304, embedding_dim=1024) (position_embeddings): Embedding(num_embeddings=1024, embedding_dim=1024) (dropout): Dropout(p=0.1, inplace=False) ) (transformer): Transformer( (layers): ModuleList( (0): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (1): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (2): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (3): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (4): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (5): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (6): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (7): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (8): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (9): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (10): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (11): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (12): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (13): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (14): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (15): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (16): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (17): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (18): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (19): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (20): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (21): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (22): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (23): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (24): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (25): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (26): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (27): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (28): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (29): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (30): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (31): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (32): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (33): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (34): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (35): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (36): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (37): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (38): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (39): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (40): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (41): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (42): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (43): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (44): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (45): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (46): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) (47): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (self_attention): MultiheadAttention( hidden_size=1024, num_heads=16, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=1024, out_features=3072, bias=True, parallel=col) (dense): Linear1D(in_features=1024, out_features=1024, bias=True, parallel=row) ) (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): MLP( bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0 (dense_h_to_4h): Linear1D(in_features=1024, out_features=4096, bias=True, parallel=col) (dense_4h_to_h): Linear1D(in_features=4096, out_features=1024, bias=True, parallel=row) ) ) ) (layernorm_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (lm_head): LMLogits() ) (loss_func): GPTLoss( (lm_loss): ParallelCrossEntropyLoss() ) ) [03/10 14:31:12 libai]: >>> done with building model. Building time: 2.419 seconds WARNING [03/10 14:31:12 lb.scheduler.lr_scheduler]: warmup iters equals to zero, return CosineLR [03/10 14:31:12 lb.engine.trainer]: Starting training from iteration 0 [03/10 14:31:12 lb.models.utils.graph_base]: Start compiling the train graph which may take some time. Please wait for a moment ... timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/03/10 14:36:30.257, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 55 %, 36 %, 12288 MiB, 6177 MiB, 5876 MiB 2023/03/10 14:36:30.260, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 40 %, 27 %, 12288 MiB, 7534 MiB, 4519 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/03/10 14:36:30.263, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 51 %, 33 %, 12288 MiB, 7130 MiB, 4923 MiB 2023/03/10 14:36:30.267, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 55 %, 36 %, 12288 MiB, 6177 MiB, 5876 MiB 2023/03/10 14:36:30.267, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 55 %, 36 %, 12288 MiB, 6177 MiB, 5876 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/03/10 14:36:30.271, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7142 MiB, 4911 MiB 2023/03/10 14:36:30.272, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 40 %, 27 %, 12288 MiB, 7534 MiB, 4519 MiB 2023/03/10 14:36:30.272, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 40 %, 27 %, 12288 MiB, 7534 MiB, 4519 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/03/10 14:36:30.273, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 55 %, 36 %, 12288 MiB, 6177 MiB, 5876 MiB 2023/03/10 14:36:30.274, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7122 MiB, 4931 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/03/10 14:36:30.275, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 14 %, 10 %, 12288 MiB, 7130 MiB, 4923 MiB 2023/03/10 14:36:30.276, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 14 %, 10 %, 12288 MiB, 7130 MiB, 4923 MiB 2023/03/10 14:36:30.276, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 55 %, 36 %, 12288 MiB, 6177 MiB, 5876 MiB 2023/03/10 14:36:30.277, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 55 %, 36 %, 12288 MiB, 6177 MiB, 5876 MiB 2023/03/10 14:36:30.277, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 40 %, 27 %, 12288 MiB, 7534 MiB, 4519 MiB 2023/03/10 14:36:30.279, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7134 MiB, 4919 MiB 2023/03/10 14:36:30.284, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7142 MiB, 4911 MiB 2023/03/10 14:36:30.279, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 55 %, 36 %, 12288 MiB, 6177 MiB, 5876 MiB 2023/03/10 14:36:30.285, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7142 MiB, 4911 MiB 2023/03/10 14:36:30.286, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 40 %, 27 %, 12288 MiB, 7534 MiB, 4519 MiB 2023/03/10 14:36:30.288, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 40 %, 27 %, 12288 MiB, 7534 MiB, 4519 MiB 2023/03/10 14:36:30.289, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 14 %, 10 %, 12288 MiB, 7130 MiB, 4923 MiB 2023/03/10 14:36:30.290, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 2 %, 12288 MiB, 7138 MiB, 4915 MiB 2023/03/10 14:36:30.291, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7122 MiB, 4931 MiB 2023/03/10 14:36:30.292, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 40 %, 27 %, 12288 MiB, 7534 MiB, 4519 MiB 2023/03/10 14:36:30.293, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7122 MiB, 4931 MiB 2023/03/10 14:36:30.294, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 14 %, 10 %, 12288 MiB, 7130 MiB, 4923 MiB 2023/03/10 14:36:30.295, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 14 %, 10 %, 12288 MiB, 7130 MiB, 4923 MiB 2023/03/10 14:36:30.296, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7142 MiB, 4911 MiB 2023/03/10 14:36:30.298, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 6 %, 1 %, 12288 MiB, 8803 MiB, 3250 MiB 2023/03/10 14:36:30.299, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7134 MiB, 4919 MiB 2023/03/10 14:36:30.300, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 14 %, 10 %, 12288 MiB, 7130 MiB, 4923 MiB 2023/03/10 14:36:30.301, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7134 MiB, 4919 MiB 2023/03/10 14:36:30.302, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7142 MiB, 4911 MiB 2023/03/10 14:36:30.303, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7142 MiB, 4911 MiB 2023/03/10 14:36:30.304, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 2 %, 12288 MiB, 7122 MiB, 4931 MiB 2023/03/10 14:36:30.307, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 2 %, 12288 MiB, 7138 MiB, 4915 MiB 2023/03/10 14:36:30.308, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7142 MiB, 4911 MiB 2023/03/10 14:36:30.308, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 2 %, 12288 MiB, 7138 MiB, 4915 MiB 2023/03/10 14:36:30.309, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 2 %, 12288 MiB, 7122 MiB, 4931 MiB 2023/03/10 14:36:30.310, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 2 %, 12288 MiB, 7122 MiB, 4931 MiB 2023/03/10 14:36:30.314, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7134 MiB, 4919 MiB 2023/03/10 14:36:30.316, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 6 %, 1 %, 12288 MiB, 8803 MiB, 3250 MiB 2023/03/10 14:36:30.317, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 2 %, 12288 MiB, 7122 MiB, 4931 MiB 2023/03/10 14:36:30.318, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 6 %, 1 %, 12288 MiB, 8803 MiB, 3250 MiB 2023/03/10 14:36:30.318, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7134 MiB, 4919 MiB 2023/03/10 14:36:30.320, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7134 MiB, 4919 MiB 2023/03/10 14:36:30.320, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 2 %, 12288 MiB, 7138 MiB, 4915 MiB 2023/03/10 14:36:30.326, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7134 MiB, 4919 MiB 2023/03/10 14:36:30.327, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 2 %, 12288 MiB, 7138 MiB, 4915 MiB 2023/03/10 14:36:30.328, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 2 %, 12288 MiB, 7138 MiB, 4915 MiB 2023/03/10 14:36:30.329, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 6 %, 1 %, 12288 MiB, 8803 MiB, 3250 MiB 2023/03/10 14:36:30.331, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 2 %, 12288 MiB, 7138 MiB, 4915 MiB 2023/03/10 14:36:30.332, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 6 %, 1 %, 12288 MiB, 8803 MiB, 3250 MiB 2023/03/10 14:36:30.333, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 6 %, 1 %, 12288 MiB, 8803 MiB, 3250 MiB 2023/03/10 14:36:30.336, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 6 %, 1 %, 12288 MiB, 8803 MiB, 3250 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/03/10 14:36:33.076, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 59 %, 38 %, 12288 MiB, 6177 MiB, 5876 MiB 2023/03/10 14:36:33.077, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 39 %, 25 %, 12288 MiB, 7534 MiB, 4519 MiB 2023/03/10 14:36:33.078, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 2 %, 12288 MiB, 7130 MiB, 4923 MiB 2023/03/10 14:36:33.080, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 14 %, 9 %, 12288 MiB, 7142 MiB, 4911 MiB 2023/03/10 14:36:33.081, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7122 MiB, 4931 MiB 2023/03/10 14:36:33.082, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 7134 MiB, 4919 MiB 2023/03/10 14:36:33.083, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 2 %, 12288 MiB, 7138 MiB, 4915 MiB 2023/03/10 14:36:33.084, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 8803 MiB, 3250 MiB [03/10 14:36:35 lb.utils.events]: eta: 0:05:39 iteration: 99/220 consumed_samples: 9600 total_loss: 7.254 time: 2.8197 s/iter data_time: 0.0123 s/iter total_throughput: 34.05 samples/s lr: 8.74e-05 [03/10 14:41:19 lb.utils.events]: eta: 0:00:56 iteration: 199/220 consumed_samples: 19200 total_loss: 7.032 time: 2.8298 s/iter data_time: 0.0121 s/iter total_throughput: 33.92 samples/s lr: 4.81e-06 [03/10 14:42:16 lb.utils.events]: eta: 0:00:00 iteration: 219/220 consumed_samples: 21120 total_loss: 6.901 time: 2.8308 s/iter data_time: 0.0117 s/iter total_throughput: 33.91 samples/s lr: 1.51e-06 [03/10 14:42:16 lb.engine.hooks]: Overall training speed: 218 iterations in 0:10:17 (2.8308 s / it) [03/10 14:42:16 lb.engine.hooks]: Total training time: 0:10:17 (0:00:00 on hooks) ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** oneflow-version(git_commit)=0.9.1.dev20230309+cu117 oneflow-commit(git_commit)=1ea2bb7 oneflow-libai(git_commit)=50a973dc5de635b8613ad7666c073c763e238850