loaded library: /lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: loaded library: /lib/x86_64-linux-gnu/libibverbs.so.1/lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: /lib/x86_64-linux-gnu/libibverbs.so.1 loaded library: /lib/x86_64-linux-gnu/libibverbs.so.1 [10/25 13:46:37 libai]: Rank of current process: 0. World size: 8 [10/25 13:46:37 libai]: Command line arguments: Namespace(config_file='projects/T5/configs/mt5_pretrain.py', eval_only=False, fast_dev_run=False, opts=['train.output_dir=test_logs/2n4g'], resume=False) [10/25 13:46:37 libai]: Contents of args.config_file=projects/T5/configs/mt5_pretrain.py: from libai import evaluation from libai.data.build import build_nlp_train_loader from omegaconf import OmegaConf from libai.config import LazyCall from libai.evaluation import PPLEvaluator, evaluator from libai.scheduler import WarmupExponentialLR from configs.common.train import train from configs.common.models.graph import graph from projects.T5.configs.optim import optim from projects.T5.configs.t5_model_config import cfg from projects.T5.datasets.dataset import UnsuperviseT5Dataset, collate_fn from projects.T5.models.t5_model import T5ForPreTraining train_data_path = "projects/T5/data/training_data/part_0" graph.debug = 1 #graph.auto_parallel.enabled=True micro_batch_size = 16 optim["lr"] = 1e-5 # dataloader dataloader = OmegaConf.create() dataloader.train = LazyCall(build_nlp_train_loader)(  dataset=[  LazyCall(UnsuperviseT5Dataset)(  data_path=train_data_path,  )  ],  collate_fn=collate_fn(  vocab_size=12902,  max_seq_length=512,  noise_density=0.15,  mean_noise_span_length=3,  eos_token_id=2,  pad_token_id=0,  decoder_start_token_id=1,  ), ) model = LazyCall(T5ForPreTraining)(cfg=cfg) # model config model.cfg.vocab_size = 12900 model.cfg.hidden_size = 768 model.cfg.hidden_layers = 12 model.cfg.num_attention_heads = 12 model.cfg.head_size = 64 model.cfg.intermediate_size = 3072 model.cfg.hidden_dropout_prob = 0.1 model.cfg.attention_probs_dropout_prob = 0.1 model.cfg.embedding_dropout_prob = 0.1 model.cfg.layernorm_eps = 1e-6 model.cfg.model_type = "mt5" model.cfg.pretrained_model_path = None train.update(  dict(  output_dir="projects/T5/output/mt5_output",  train_micro_batch_size=micro_batch_size,  train_iter=220,  log_period=1,  num_accumulation_steps=8,  amp=dict(enabled=True),  warmup_ratio=0.01,  checkpointer=dict(period=10000, max_to_keep=10),  dist=dict(  data_parallel_size=4,  tensor_parallel_size=2,  pipeline_parallel_size=1,  pipeline_num_layers=2 * model.cfg.hidden_layers,  ),  scheduler=LazyCall(WarmupExponentialLR)(  warmup_factor=0.001,  gamma=1.0,  warmup_method="linear",  warmup_iter=0.0,  ),  evaluation=dict(  evaluator=LazyCall(PPLEvaluator)(),  enabled=True,  eval_iter=20,  eval_period=5000,  ),  ) ) train.zero_optimization.enabled = True train.zero_optimization.stage = 2 [10/25 13:46:37 libai]: Full config saved to test_logs/2n4g/config.yaml [10/25 13:46:37 lb.engine.default]: > compiling dataset index builder ... make: Entering directory '/home/xuyongning/zero_test/t5_test/libai/libai/data/data_utils' make: Nothing to be done for 'default'. make: Leaving directory '/home/xuyongning/zero_test/t5_test/libai/libai/data/data_utils' [10/25 13:46:37 lb.engine.default]: >>> done with dataset index builder. Compilation time: 0.128 seconds [10/25 13:46:37 lb.engine.default]: >>> done with compiling. Compilation time: 0.130 seconds [10/25 13:46:39 lb.engine.default]: Prepare training, validating, testing set [10/25 13:46:48 lb.engine.default]: Auto-scaling the config to train.train_iter=220, train.warmup_iter=3 [10/25 13:46:58 lb.engine.default]: Model: T5ForPreTraining( (t5_model): T5Model( (embedding): T5Embedding( (word_embeddings): VocabEmbedding(num_embeddings=12900, embedding_dim=768) (embedding_dropout): Dropout(p=0.1, inplace=False) ) (extended_attn_mask): ExtendedMask() (encoder): Sequential( (layers): ModuleList( (0): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) (relative_attention_bias): Embedding(num_embeddings=32, embedding_dim=12) ) (post_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (2): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (3): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (4): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (5): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (6): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (7): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (8): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (9): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (10): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (11): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) ) (final_layernorm): LayerNorm() ) (decoder): Sequential( (layers): ModuleList( (0): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) (relative_attention_bias): Embedding(num_embeddings=32, embedding_dim=12) ) (post_attention_layernorm): LayerNorm() (cross_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=True (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query): Linear1D(in_features=768, out_features=768, bias=False, parallel=col) (key_value): Linear1D(in_features=768, out_features=1536, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_cross_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (cross_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=True (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query): Linear1D(in_features=768, out_features=768, bias=False, parallel=col) (key_value): Linear1D(in_features=768, out_features=1536, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_cross_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (2): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (cross_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=True (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query): Linear1D(in_features=768, out_features=768, bias=False, parallel=col) (key_value): Linear1D(in_features=768, out_features=1536, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_cross_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (3): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (cross_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=True (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query): Linear1D(in_features=768, out_features=768, bias=False, parallel=col) (key_value): Linear1D(in_features=768, out_features=1536, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_cross_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (4): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (cross_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=True (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query): Linear1D(in_features=768, out_features=768, bias=False, parallel=col) (key_value): Linear1D(in_features=768, out_features=1536, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_cross_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (5): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (cross_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=True (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query): Linear1D(in_features=768, out_features=768, bias=False, parallel=col) (key_value): Linear1D(in_features=768, out_features=1536, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_cross_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (6): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (cross_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=True (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query): Linear1D(in_features=768, out_features=768, bias=False, parallel=col) (key_value): Linear1D(in_features=768, out_features=1536, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_cross_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (7): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (cross_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=True (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query): Linear1D(in_features=768, out_features=768, bias=False, parallel=col) (key_value): Linear1D(in_features=768, out_features=1536, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_cross_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (8): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (cross_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=True (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query): Linear1D(in_features=768, out_features=768, bias=False, parallel=col) (key_value): Linear1D(in_features=768, out_features=1536, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_cross_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (9): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (cross_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=True (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query): Linear1D(in_features=768, out_features=768, bias=False, parallel=col) (key_value): Linear1D(in_features=768, out_features=1536, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_cross_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (10): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (cross_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=True (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query): Linear1D(in_features=768, out_features=768, bias=False, parallel=col) (key_value): Linear1D(in_features=768, out_features=1536, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_cross_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) (11): TransformerLayer( (drop_path): Identity() (input_layernorm): LayerNorm() (self_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=False (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query_key_value): Linear1D(in_features=768, out_features=2304, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_attention_layernorm): LayerNorm() (cross_attention): MultiheadAttention( hidden_size=768, num_heads=12, is_cross_attention=True (dropout): Dropout(p=0.1, inplace=False) (output_dropout): Dropout(p=0.1, inplace=False) (query): Linear1D(in_features=768, out_features=768, bias=False, parallel=col) (key_value): Linear1D(in_features=768, out_features=1536, bias=False, parallel=col) (dense): Linear1D(in_features=768, out_features=768, bias=False, parallel=row) ) (post_cross_attention_layernorm): LayerNorm() (mlp): MT5MLP( (wi_0): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (wi_1): Linear1D(in_features=768, out_features=3072, bias=False, parallel=col) (activation_func): GeLUTanh() (wo): Linear1D(in_features=3072, out_features=768, bias=False, parallel=row) (dropout): Dropout(p=0.1, inplace=False) ) ) ) (final_layernorm): LayerNorm() ) (lm_head): Linear1D(in_features=768, out_features=12900, bias=False, parallel=data) ) (loss_func): T5Loss( (lm_loss): ParallelCrossEntropyLoss() ) ) [10/25 13:46:58 libai]: build model time ------------------------------------------:9.94480848312378 [10/25 13:46:59 lb.engine.default]: Graph debug mode on, automatically output debug info. [10/25 13:46:59 lb.engine.default]: Graph debug mode on, automatically output debug info. [10/25 13:47:00 lb.engine.trainer]: Starting training from iteration 0 (GRAPH:GraphBase_0:GraphBase) start building graph. (GRAPH:GraphBase_0:GraphBase) start building graph builders of parameters and buffers. (GRAPH:GraphBase_0:GraphBase) end building graph builders of parameters and buffers. (GRAPH:GraphBase_0:GraphBase) start building graph inputs. (INPUT:_GraphBase_0_input.1.0_encoder_input_ids:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), size=(512, 512), dtype=oneflow.int64)) (INPUT:_GraphBase_0_input.1.1_decoder_input_ids:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), size=(512, 114), dtype=oneflow.int64)) (INPUT:_GraphBase_0_input.1.2_encoder_attn_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), size=(512, 512), dtype=oneflow.bool)) (INPUT:_GraphBase_0_input.1.3_decoder_attn_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), size=(512, 114), dtype=oneflow.bool)) (INPUT:_GraphBase_0_input.1.4_encoder_decoder_attn_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), size=(512, 114, 512), dtype=oneflow.bool)) (INPUT:_GraphBase_0_input.1.5_lm_labels:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), size=(512, 114), dtype=oneflow.int64)) (INPUT:_GraphBase_0_input.1.6_loss_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), size=(512, 114), dtype=oneflow.int64)) (GRAPH:GraphBase_0:GraphBase) end building graph inputs. (GRAPH:GraphBase_0:GraphBase) start building graph modules. [10/25 13:47:12 lb.models.utils.graph_base]: Start compling the train graph which may take some time. Please wait for a moment ... (MODULE:model:T5ForPreTraining()) (INPUT:_model_input.1.0_encoder_input_ids:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512), dtype=oneflow.int64)) (INPUT:_model_input.1.1_decoder_input_ids:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.int64)) (INPUT:_model_input.1.2_encoder_attn_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512), dtype=oneflow.bool)) (INPUT:_model_input.1.3_decoder_attn_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.bool)) (INPUT:_model_input.1.4_encoder_decoder_attn_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 512), dtype=oneflow.bool)) (INPUT:_model_input.1.5_lm_labels:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.int64)) (INPUT:_model_input.1.6_loss_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.int64)) (MODULE:model.t5_model:T5Model()) (INPUT:_model.t5_model_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512), dtype=oneflow.int64)) (INPUT:_model.t5_model_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.int64)) (INPUT:_model.t5_model_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512), dtype=oneflow.bool)) (INPUT:_model.t5_model_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.bool)) (INPUT:_model.t5_model_input.0.4_6:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model_input.1.0_use_cache:) (MODULE:model.t5_model.extended_attn_mask:ExtendedMask()) (INPUT:_model.t5_model.extended_attn_mask_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512), dtype=oneflow.bool)) (INPUT:_model.t5_model.extended_attn_mask_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.extended_attn_mask_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (OUTPUT:_model.t5_model.extended_attn_mask_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) (MODULE:model.t5_model.embedding:T5Embedding()) (INPUT:_model.t5_model.embedding_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512), dtype=oneflow.int64)) (INPUT:_model.t5_model.embedding_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512), dtype=oneflow.int64)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.embedding_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512), dtype=oneflow.int64)) is a Tensor, insert_identity transformation has been done. (MODULE:model.t5_model.embedding.word_embeddings:VocabEmbedding(num_embeddings=12900, embedding_dim=768)) (INPUT:_model.t5_model.embedding.word_embeddings_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512), dtype=oneflow.int64)) (PARAMETER:model.t5_model.embedding.word_embeddings.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(12900, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.embedding.word_embeddings_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.embedding.embedding_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.embedding.embedding_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.embedding.embedding_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.embedding_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.0:TransformerLayer()) (INPUT:_model.t5_model.encoder.layers.0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.0_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.encoder.layers.0_input.1.0_position_bias:) (INPUT:_model.t5_model.encoder.layers.0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.0_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.encoder.layers.0_input.1.0_position_bias:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.encoder.layers.0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.0_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.encoder.layers.0_input.1.0_position_bias:) is not a Tensor, insert_identity transformation will be ignored. (MODULE:model.t5_model.encoder.layers.0.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.0.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.0.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.0.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.0.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.encoder.layers.0.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.0.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.encoder.layers.0.self_attention_input.1.1_past_key_value:) [WARNING](INPUT:_model.t5_model.encoder.layers.0.self_attention_input.1.2_position_bias:) [WARNING](INPUT:_model.t5_model.encoder.layers.0.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.encoder.layers.0.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.0.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.0.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.0.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 2304), dtype=oneflow.float32, grad_fn=)) Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 (MODULE:model.t5_model.encoder.layers.0.self_attention.relative_attention_bias:Embedding(num_embeddings=32, embedding_dim=12)) (INPUT:_model.t5_model.encoder.layers.0.self_attention.relative_attention_bias_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), is_lazy='True', size=(512, 512), dtype=oneflow.int64)) (PARAMETER:model.t5_model.encoder.layers.0.self_attention.relative_attention_bias.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(32, 12), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.0.self_attention.relative_attention_bias_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), is_lazy='True', size=(512, 512, 12), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.0.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.0.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.0.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.0.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.0.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.0.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.0.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.0.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.0.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.0.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.0.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.0.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.0.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.0.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.0.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.0.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.0.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.0.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.0.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.0.mlp:MT5MLP()) (INPUT:_model.t5_model.encoder.layers.0.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.0.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.0.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.0.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.0.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.0.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.encoder.layers.0.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.0.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.0.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.0.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.0.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.0.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.0.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.0.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.0.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.0.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.0.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.0.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.0.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.0.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.0.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.0.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.0.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.0_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.1:TransformerLayer()) (INPUT:_model.t5_model.encoder.layers.1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.1_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) (INPUT:_model.t5_model.encoder.layers.1_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.1_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.1_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.1_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.1_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (MODULE:model.t5_model.encoder.layers.1.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.1.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.1.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.1.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.1.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.encoder.layers.1.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.1.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.encoder.layers.1.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.encoder.layers.1.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.encoder.layers.1.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.encoder.layers.1.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.1.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.1.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.1.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.1.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.1.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.1.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.1.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.1.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.1.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.1.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.1.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.1.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.1.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.1.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.1.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.1.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.1.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.1.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.1.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.1.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.1.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.1.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.1.mlp:MT5MLP()) (INPUT:_model.t5_model.encoder.layers.1.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.1.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.1.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.1.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.1.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.1.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.encoder.layers.1.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.1.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.1.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.1.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.1.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.1.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.1.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.1.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.1.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.1.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.1.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.1.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.1.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.1.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.1.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.1.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.1.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.1_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.2:TransformerLayer()) (INPUT:_model.t5_model.encoder.layers.2_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.2_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) (INPUT:_model.t5_model.encoder.layers.2_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.2_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.2_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.2_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.2_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.2_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.2_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (MODULE:model.t5_model.encoder.layers.2.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.2.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.2.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.2.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.2.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.encoder.layers.2.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.2.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.encoder.layers.2.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.encoder.layers.2.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.encoder.layers.2.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.encoder.layers.2.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.2.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.2.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.2.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.2.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.2.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.2.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.2.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.2.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.2.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.2.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.2.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.2.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.2.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.2.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.2.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.2.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.2.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.2.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.2.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.2.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.2.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.2.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.2.mlp:MT5MLP()) (INPUT:_model.t5_model.encoder.layers.2.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.2.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.2.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.2.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.2.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.2.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.encoder.layers.2.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.2.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.2.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.2.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.2.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.2.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.2.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.2.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.2.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.2.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.2.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.2.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.2.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.2.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.2.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.2.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.2.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.2_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.2_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.3:TransformerLayer()) (INPUT:_model.t5_model.encoder.layers.3_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.3_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) (INPUT:_model.t5_model.encoder.layers.3_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.3_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.3_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.3_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.3_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.3_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.3_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (MODULE:model.t5_model.encoder.layers.3.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.3.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.3.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.3.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.3.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.encoder.layers.3.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.3.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.encoder.layers.3.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.encoder.layers.3.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.encoder.layers.3.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.encoder.layers.3.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.3.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.3.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.3.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.3.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.3.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.3.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.3.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.3.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.3.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.3.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.3.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.3.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.3.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.3.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.3.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.3.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.3.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.3.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.3.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.3.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.3.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.3.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.3.mlp:MT5MLP()) (INPUT:_model.t5_model.encoder.layers.3.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.3.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.3.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.3.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.3.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.3.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.encoder.layers.3.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.3.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.3.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.3.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.3.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.3.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.3.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.3.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.3.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.3.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.3.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.3.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.3.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.3.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.3.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.3.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.3.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.3_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.3_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.4:TransformerLayer()) (INPUT:_model.t5_model.encoder.layers.4_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.4_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) (INPUT:_model.t5_model.encoder.layers.4_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.4_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.4_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.4_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.4_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.4_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.4_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (MODULE:model.t5_model.encoder.layers.4.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.4.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.4.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.4.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.4.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.encoder.layers.4.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.4.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.encoder.layers.4.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.encoder.layers.4.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.encoder.layers.4.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.encoder.layers.4.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.4.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.4.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.4.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.4.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.4.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.4.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.4.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.4.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.4.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.4.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.4.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.4.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.4.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.4.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.4.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.4.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.4.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.4.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.4.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.4.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.4.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.4.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.4.mlp:MT5MLP()) (INPUT:_model.t5_model.encoder.layers.4.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.4.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.4.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.4.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.4.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.4.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.encoder.layers.4.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.4.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.4.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.4.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.4.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.4.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.4.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.4.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.4.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.4.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.4.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.4.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.4.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.4.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.4.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.4.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.4.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.4_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.4_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.5:TransformerLayer()) (INPUT:_model.t5_model.encoder.layers.5_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.5_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) (INPUT:_model.t5_model.encoder.layers.5_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.5_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.5_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.5_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.5_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.5_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.5_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (MODULE:model.t5_model.encoder.layers.5.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.5.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.5.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.5.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.5.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.encoder.layers.5.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.5.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.encoder.layers.5.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.encoder.layers.5.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.encoder.layers.5.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.encoder.layers.5.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.5.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.5.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.5.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.5.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.5.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.5.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.5.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.5.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.5.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.5.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.5.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.5.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.5.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.5.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.5.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.5.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.5.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.5.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.5.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.5.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.5.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.5.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.5.mlp:MT5MLP()) (INPUT:_model.t5_model.encoder.layers.5.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.5.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.5.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.5.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.5.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.5.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.encoder.layers.5.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.5.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.5.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.5.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.5.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.5.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.5.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.5.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.5.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.5.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.5.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.5.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.5.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.5.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.5.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.5.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.5.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.5_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.5_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.6:TransformerLayer()) (INPUT:_model.t5_model.encoder.layers.6_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.6_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) (INPUT:_model.t5_model.encoder.layers.6_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.6_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.6_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.6_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.6_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.6_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.6_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (MODULE:model.t5_model.encoder.layers.6.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.6.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.6.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.6.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.6.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.encoder.layers.6.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.6.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.encoder.layers.6.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.encoder.layers.6.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.encoder.layers.6.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.encoder.layers.6.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.6.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.6.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.6.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.6.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.6.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.6.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.6.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.6.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.6.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.6.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.6.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.6.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.6.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.6.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.6.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.6.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.6.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.6.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.6.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.6.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.6.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.6.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.6.mlp:MT5MLP()) (INPUT:_model.t5_model.encoder.layers.6.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.6.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.6.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.6.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.6.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.6.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.encoder.layers.6.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.6.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.6.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.6.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.6.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.6.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.6.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.6.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.6.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.6.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.6.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.6.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.6.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.6.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.6.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.6.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.6.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.6_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.6_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.7:TransformerLayer()) (INPUT:_model.t5_model.encoder.layers.7_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.7_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) (INPUT:_model.t5_model.encoder.layers.7_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.7_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.7_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.7_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.7_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.7_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.7_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (MODULE:model.t5_model.encoder.layers.7.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.7.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.7.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.7.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.7.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.encoder.layers.7.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.7.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.encoder.layers.7.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.encoder.layers.7.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.encoder.layers.7.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.encoder.layers.7.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.7.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.7.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.7.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.7.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.7.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.7.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.7.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.7.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.7.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.7.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.7.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.7.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.7.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.7.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.7.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.7.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.7.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.7.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.7.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.7.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.7.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.7.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.7.mlp:MT5MLP()) (INPUT:_model.t5_model.encoder.layers.7.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.7.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.7.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.7.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.7.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.7.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.encoder.layers.7.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.7.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.7.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.7.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.7.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.7.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.7.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.7.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.7.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.7.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.7.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.7.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.7.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.7.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.7.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.7.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.7.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.7_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.7_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.8:TransformerLayer()) (INPUT:_model.t5_model.encoder.layers.8_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.8_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) (INPUT:_model.t5_model.encoder.layers.8_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.8_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.8_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.8_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.8_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.8_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.8_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (MODULE:model.t5_model.encoder.layers.8.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.8.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.8.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.8.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.8.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.encoder.layers.8.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.8.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.encoder.layers.8.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.encoder.layers.8.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.encoder.layers.8.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.encoder.layers.8.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.8.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.8.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.8.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.8.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.8.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.8.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.8.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.8.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.8.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.8.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.8.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.8.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.8.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.8.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.8.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.8.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.8.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.8.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.8.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.8.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.8.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.8.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.8.mlp:MT5MLP()) (INPUT:_model.t5_model.encoder.layers.8.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.8.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.8.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.8.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.8.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.8.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.encoder.layers.8.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.8.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.8.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.8.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.8.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.8.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.8.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.8.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.8.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.8.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.8.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.8.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.8.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.8.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.8.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.8.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.8.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.8_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.8_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.9:TransformerLayer()) (INPUT:_model.t5_model.encoder.layers.9_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.9_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) (INPUT:_model.t5_model.encoder.layers.9_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.9_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.9_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.9_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.9_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.9_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.9_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (MODULE:model.t5_model.encoder.layers.9.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.9.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.9.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.9.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.9.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.encoder.layers.9.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.9.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.encoder.layers.9.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.encoder.layers.9.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.encoder.layers.9.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.encoder.layers.9.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.9.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.9.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.9.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.9.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.9.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.9.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.9.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.9.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.9.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.9.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.9.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.9.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.9.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.9.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.9.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.9.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.9.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.9.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.9.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.9.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.9.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.9.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.9.mlp:MT5MLP()) (INPUT:_model.t5_model.encoder.layers.9.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.9.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.9.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.9.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.9.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.9.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.encoder.layers.9.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.9.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.9.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.9.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.9.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.9.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.9.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.9.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.9.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.9.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.9.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.9.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.9.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.9.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.9.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.9.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.9.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.9_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.9_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.10:TransformerLayer()) (INPUT:_model.t5_model.encoder.layers.10_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.10_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) (INPUT:_model.t5_model.encoder.layers.10_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.10_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.10_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.10_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.10_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.10_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.10_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (MODULE:model.t5_model.encoder.layers.10.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.10.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.10.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.10.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.10.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.encoder.layers.10.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.10.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.encoder.layers.10.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.encoder.layers.10.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.encoder.layers.10.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.encoder.layers.10.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.10.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.10.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.10.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.10.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.10.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.10.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.10.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.10.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.10.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.10.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.10.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.10.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.10.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.10.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.10.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.10.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.10.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.10.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.10.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.10.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.10.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.10.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.10.mlp:MT5MLP()) (INPUT:_model.t5_model.encoder.layers.10.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.10.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.10.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.10.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.10.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.10.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.encoder.layers.10.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.10.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.10.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.10.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.10.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.10.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.10.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.10.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.10.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.10.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.10.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.10.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.10.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.10.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.10.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.10.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.10.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.10_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.10_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.11:TransformerLayer()) (INPUT:_model.t5_model.encoder.layers.11_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.11_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) (INPUT:_model.t5_model.encoder.layers.11_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.11_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.11_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.11_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.layers.11_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.11_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.encoder.layers.11_input.1.0_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (MODULE:model.t5_model.encoder.layers.11.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.11.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.11.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.11.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.11.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.encoder.layers.11.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.encoder.layers.11.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 1, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.encoder.layers.11.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.encoder.layers.11.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.encoder.layers.11.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.encoder.layers.11.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.11.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.11.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.11.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.11.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.11.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.11.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.11.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.11.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.11.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.11.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.11.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.11.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.11.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.11.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.11.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.11.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.11.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.11.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.11.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.layers.11.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.11.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.11.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.11.mlp:MT5MLP()) (INPUT:_model.t5_model.encoder.layers.11.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.11.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.11.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.11.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.11.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.11.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.encoder.layers.11.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.11.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.11.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.encoder.layers.11.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.11.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.11.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.11.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.encoder.layers.11.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.layers.11.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.encoder.layers.11.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.11.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.encoder.layers.11.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.11.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.11.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.layers.11.drop_path:Identity()) (INPUT:_model.t5_model.encoder.layers.11.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.11.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.11_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.encoder.layers.11_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 512, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.encoder.final_layernorm:LayerNorm()) (INPUT:_model.t5_model.encoder.final_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.encoder.final_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (INPUT:_model.t5_model.encoder.final_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.encoder.final_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (OUTPUT:_model.t5_model.encoder.final_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.extended_attn_mask:ExtendedMask()) (INPUT:_model.t5_model.extended_attn_mask_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.bool)) (INPUT:_model.t5_model.extended_attn_mask_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.int64)) [WARNING](INPUT:_model.t5_model.extended_attn_mask_input.1.0_is_decoder:) (INPUT:_model.t5_model.extended_attn_mask_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.extended_attn_mask_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.int64)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.extended_attn_mask_input.1.0_is_decoder:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.extended_attn_mask_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.extended_attn_mask_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.int64)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.extended_attn_mask_input.1.0_is_decoder:) is not a Tensor, insert_identity transformation will be ignored. (OUTPUT:_model.t5_model.extended_attn_mask_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) (MODULE:model.t5_model.extended_attn_mask:ExtendedMask()) (INPUT:_model.t5_model.extended_attn_mask_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 512), dtype=oneflow.bool)) (INPUT:_model.t5_model.extended_attn_mask_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.extended_attn_mask_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (OUTPUT:_model.t5_model.extended_attn_mask_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) (MODULE:model.t5_model.embedding:T5Embedding()) (INPUT:_model.t5_model.embedding_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.int64)) (INPUT:_model.t5_model.embedding_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.int64)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.embedding_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.int64)) is a Tensor, insert_identity transformation has been done. (MODULE:model.t5_model.embedding.word_embeddings:VocabEmbedding(num_embeddings=12900, embedding_dim=768)) (INPUT:_model.t5_model.embedding.word_embeddings_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.int64)) (PARAMETER:model.t5_model.embedding.word_embeddings.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(12900, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.embedding.word_embeddings_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.embedding.embedding_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.embedding.embedding_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.embedding.embedding_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.embedding_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0:TransformerLayer()) (INPUT:_model.t5_model.decoder.layers.0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.0_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) (INPUT:_model.t5_model.decoder.layers.0_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.0_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.0_input.1.0_past_key_value:) [WARNING](INPUT:_model.t5_model.decoder.layers.0_input.1.1_position_bias:) [WARNING](INPUT:_model.t5_model.decoder.layers.0_input.1.2_encoder_decoder_position_bias:) [WARNING](INPUT:_model.t5_model.decoder.layers.0_input.1.3_use_cache:) (INPUT:_model.t5_model.decoder.layers.0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.0_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.0_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.0_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.0_input.1.0_past_key_value:) is not a Tensor, insert_to_global transformation will be ignored. [WARNING](INPUT:_model.t5_model.decoder.layers.0_input.1.1_position_bias:) is not a Tensor, insert_to_global transformation will be ignored. [WARNING](INPUT:_model.t5_model.decoder.layers.0_input.1.2_encoder_decoder_position_bias:) is not a Tensor, insert_to_global transformation will be ignored. [WARNING](INPUT:_model.t5_model.decoder.layers.0_input.1.3_use_cache:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.0_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.0_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.0_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.0_input.1.0_past_key_value:) is not a Tensor, insert_identity transformation will be ignored. [WARNING](INPUT:_model.t5_model.decoder.layers.0_input.1.1_position_bias:) is not a Tensor, insert_identity transformation will be ignored. [WARNING](INPUT:_model.t5_model.decoder.layers.0_input.1.2_encoder_decoder_position_bias:) is not a Tensor, insert_identity transformation will be ignored. [WARNING](INPUT:_model.t5_model.decoder.layers.0_input.1.3_use_cache:) is not a Tensor, insert_identity transformation will be ignored. (MODULE:model.t5_model.decoder.layers.0.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.0.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.0.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.0.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.decoder.layers.0.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.0.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.0.self_attention_input.1.1_past_key_value:) [WARNING](INPUT:_model.t5_model.decoder.layers.0.self_attention_input.1.2_position_bias:) [WARNING](INPUT:_model.t5_model.decoder.layers.0.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.decoder.layers.0.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.0.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.0.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.0.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.self_attention.relative_attention_bias:Embedding(num_embeddings=32, embedding_dim=12)) (INPUT:_model.t5_model.decoder.layers.0.self_attention.relative_attention_bias_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), is_lazy='True', size=(114, 114), dtype=oneflow.int64)) (PARAMETER:model.t5_model.decoder.layers.0.self_attention.relative_attention_bias.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(32, 12), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.0.self_attention.relative_attention_bias_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), is_lazy='True', size=(114, 114, 12), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.0.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.0.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.0.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.0.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.0.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.0.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.0.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.0.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.0.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.0.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.0.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.0.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.0.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.0.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.cross_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=True)) (INPUT:_model.t5_model.decoder.layers.0.cross_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.0.cross_attention_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.0.cross_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.0.cross_attention_input.1.1_past_key_value:) [WARNING](INPUT:_model.t5_model.decoder.layers.0.cross_attention_input.1.2_position_bias:) [WARNING](INPUT:_model.t5_model.decoder.layers.0.cross_attention_input.1.3_use_cache:) [WARNING](INPUT:_model.t5_model.decoder.layers.0.cross_attention_input.1.4_query_length:) (MODULE:model.t5_model.decoder.layers.0.cross_attention.query:Linear1D(in_features=768, out_features=768, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.0.cross_attention.query_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.0.cross_attention.query.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.0.cross_attention.query_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.cross_attention.key_value:Linear1D(in_features=768, out_features=1536, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.0.cross_attention.key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.0.cross_attention.key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(1536, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.0.cross_attention.key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 1536), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.cross_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.0.cross_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.0.cross_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.cross_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.0.cross_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.0.cross_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.0.cross_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.cross_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.0.cross_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.0.cross_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.0.cross_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.0.cross_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.0.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.0.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.0.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.post_cross_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.0.post_cross_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.0.post_cross_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.0.post_cross_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.mlp:MT5MLP()) (INPUT:_model.t5_model.decoder.layers.0.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.0.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.0.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.0.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.decoder.layers.0.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.0.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.0.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.0.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.0.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.0.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.0.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.0.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.0.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.0.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.0.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.0.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.0.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.0.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.0_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.0_output.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.1:TransformerLayer()) (INPUT:_model.t5_model.decoder.layers.1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.1_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) (INPUT:_model.t5_model.decoder.layers.1_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.1_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.1_input.1.0_past_key_value:) (INPUT:_model.t5_model.decoder.layers.1_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.1_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.1_input.1.3_use_cache:) (INPUT:_model.t5_model.decoder.layers.1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.1_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.1_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.1_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.1_input.1.0_past_key_value:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.1_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.1_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.1_input.1.3_use_cache:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.1_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.1_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.1_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.1_input.1.0_past_key_value:) is not a Tensor, insert_identity transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.1_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.1_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.1_input.1.3_use_cache:) is not a Tensor, insert_identity transformation will be ignored. (MODULE:model.t5_model.decoder.layers.1.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.1.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.1.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.1.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.decoder.layers.1.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.1.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.1.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.1.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.decoder.layers.1.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.decoder.layers.1.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.1.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.1.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.1.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.1.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.1.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.1.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.1.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.1.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.1.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.1.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.1.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.1.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.1.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.1.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.1.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.1.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.1.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.cross_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=True)) (INPUT:_model.t5_model.decoder.layers.1.cross_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.1.cross_attention_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.1.cross_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.1.cross_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.1.cross_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.1.cross_attention_input.1.3_use_cache:) [WARNING](INPUT:_model.t5_model.decoder.layers.1.cross_attention_input.1.4_query_length:) (MODULE:model.t5_model.decoder.layers.1.cross_attention.query:Linear1D(in_features=768, out_features=768, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.1.cross_attention.query_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.1.cross_attention.query.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.1.cross_attention.query_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.cross_attention.key_value:Linear1D(in_features=768, out_features=1536, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.1.cross_attention.key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.1.cross_attention.key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(1536, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.1.cross_attention.key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 1536), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.cross_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.1.cross_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.1.cross_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.cross_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.1.cross_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.1.cross_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.1.cross_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.cross_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.1.cross_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.1.cross_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.1.cross_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.1.cross_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.1.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.1.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.1.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.post_cross_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.1.post_cross_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.1.post_cross_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.1.post_cross_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.mlp:MT5MLP()) (INPUT:_model.t5_model.decoder.layers.1.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.1.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.1.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.1.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.decoder.layers.1.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.1.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.1.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.1.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.1.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.1.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.1.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.1.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.1.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.1.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.1.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.1.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.1.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.1.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.1_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.1_output.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.2:TransformerLayer()) (INPUT:_model.t5_model.decoder.layers.2_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.2_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) (INPUT:_model.t5_model.decoder.layers.2_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.2_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.2_input.1.0_past_key_value:) (INPUT:_model.t5_model.decoder.layers.2_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.2_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.2_input.1.3_use_cache:) (INPUT:_model.t5_model.decoder.layers.2_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.2_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.2_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.2_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.2_input.1.0_past_key_value:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.2_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.2_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.2_input.1.3_use_cache:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.2_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.2_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.2_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.2_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.2_input.1.0_past_key_value:) is not a Tensor, insert_identity transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.2_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.2_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.2_input.1.3_use_cache:) is not a Tensor, insert_identity transformation will be ignored. (MODULE:model.t5_model.decoder.layers.2.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.2.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.2.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.2.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.decoder.layers.2.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.2.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.2.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.2.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.decoder.layers.2.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.decoder.layers.2.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.2.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.2.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.2.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.2.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.2.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.2.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.2.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.2.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.2.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.2.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.2.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.2.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.2.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.2.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.2.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.2.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.2.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.cross_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=True)) (INPUT:_model.t5_model.decoder.layers.2.cross_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.2.cross_attention_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.2.cross_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.2.cross_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.2.cross_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.2.cross_attention_input.1.3_use_cache:) [WARNING](INPUT:_model.t5_model.decoder.layers.2.cross_attention_input.1.4_query_length:) (MODULE:model.t5_model.decoder.layers.2.cross_attention.query:Linear1D(in_features=768, out_features=768, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.2.cross_attention.query_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.2.cross_attention.query.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.2.cross_attention.query_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.cross_attention.key_value:Linear1D(in_features=768, out_features=1536, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.2.cross_attention.key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.2.cross_attention.key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(1536, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.2.cross_attention.key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 1536), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.cross_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.2.cross_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.2.cross_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.cross_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.2.cross_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.2.cross_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.2.cross_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.cross_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.2.cross_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.2.cross_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.2.cross_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.2.cross_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.2.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.2.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.2.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.post_cross_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.2.post_cross_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.2.post_cross_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.2.post_cross_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.mlp:MT5MLP()) (INPUT:_model.t5_model.decoder.layers.2.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.2.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.2.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.2.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.decoder.layers.2.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.2.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.2.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.2.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.2.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.2.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.2.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.2.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.2.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.2.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.2.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.2.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.2.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.2.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.2_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.2_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.2_output.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.3:TransformerLayer()) (INPUT:_model.t5_model.decoder.layers.3_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.3_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) (INPUT:_model.t5_model.decoder.layers.3_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.3_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.3_input.1.0_past_key_value:) (INPUT:_model.t5_model.decoder.layers.3_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.3_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.3_input.1.3_use_cache:) (INPUT:_model.t5_model.decoder.layers.3_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.3_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.3_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.3_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.3_input.1.0_past_key_value:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.3_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.3_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.3_input.1.3_use_cache:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.3_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.3_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.3_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.3_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.3_input.1.0_past_key_value:) is not a Tensor, insert_identity transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.3_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.3_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.3_input.1.3_use_cache:) is not a Tensor, insert_identity transformation will be ignored. (MODULE:model.t5_model.decoder.layers.3.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.3.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.3.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.3.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.decoder.layers.3.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.3.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.3.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.3.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.decoder.layers.3.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.decoder.layers.3.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.3.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.3.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.3.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.3.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.3.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.3.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.3.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.3.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.3.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.3.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.3.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.3.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.3.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.3.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.3.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.3.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.3.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.cross_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=True)) (INPUT:_model.t5_model.decoder.layers.3.cross_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.3.cross_attention_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.3.cross_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.3.cross_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.3.cross_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.3.cross_attention_input.1.3_use_cache:) [WARNING](INPUT:_model.t5_model.decoder.layers.3.cross_attention_input.1.4_query_length:) (MODULE:model.t5_model.decoder.layers.3.cross_attention.query:Linear1D(in_features=768, out_features=768, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.3.cross_attention.query_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.3.cross_attention.query.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.3.cross_attention.query_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.cross_attention.key_value:Linear1D(in_features=768, out_features=1536, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.3.cross_attention.key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.3.cross_attention.key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(1536, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.3.cross_attention.key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 1536), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.cross_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.3.cross_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.3.cross_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.cross_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.3.cross_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.3.cross_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.3.cross_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.cross_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.3.cross_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.3.cross_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.3.cross_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.3.cross_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.3.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.3.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.3.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.post_cross_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.3.post_cross_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.3.post_cross_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.3.post_cross_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.mlp:MT5MLP()) (INPUT:_model.t5_model.decoder.layers.3.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.3.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.3.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.3.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.decoder.layers.3.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.3.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.3.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.3.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.3.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.3.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.3.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.3.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.3.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.3.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.3.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.3.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.3.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.3.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.3_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.3_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.3_output.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.4:TransformerLayer()) (INPUT:_model.t5_model.decoder.layers.4_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.4_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) (INPUT:_model.t5_model.decoder.layers.4_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.4_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.4_input.1.0_past_key_value:) (INPUT:_model.t5_model.decoder.layers.4_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.4_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.4_input.1.3_use_cache:) (INPUT:_model.t5_model.decoder.layers.4_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.4_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.4_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.4_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.4_input.1.0_past_key_value:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.4_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.4_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.4_input.1.3_use_cache:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.4_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.4_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.4_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.4_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.4_input.1.0_past_key_value:) is not a Tensor, insert_identity transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.4_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.4_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.4_input.1.3_use_cache:) is not a Tensor, insert_identity transformation will be ignored. (MODULE:model.t5_model.decoder.layers.4.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.4.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.4.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.4.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.decoder.layers.4.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.4.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.4.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.4.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.decoder.layers.4.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.decoder.layers.4.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.4.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.4.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.4.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.4.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.4.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.4.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.4.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.4.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.4.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.4.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.4.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.4.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.4.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.4.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.4.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.4.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.4.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.cross_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=True)) (INPUT:_model.t5_model.decoder.layers.4.cross_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.4.cross_attention_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.4.cross_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.4.cross_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.4.cross_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.4.cross_attention_input.1.3_use_cache:) [WARNING](INPUT:_model.t5_model.decoder.layers.4.cross_attention_input.1.4_query_length:) (MODULE:model.t5_model.decoder.layers.4.cross_attention.query:Linear1D(in_features=768, out_features=768, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.4.cross_attention.query_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.4.cross_attention.query.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.4.cross_attention.query_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.cross_attention.key_value:Linear1D(in_features=768, out_features=1536, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.4.cross_attention.key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.4.cross_attention.key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(1536, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.4.cross_attention.key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 1536), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.cross_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.4.cross_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.4.cross_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.cross_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.4.cross_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.4.cross_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.4.cross_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.cross_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.4.cross_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.4.cross_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.4.cross_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.4.cross_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.4.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.4.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.4.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.post_cross_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.4.post_cross_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.4.post_cross_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.4.post_cross_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.mlp:MT5MLP()) (INPUT:_model.t5_model.decoder.layers.4.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.4.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.4.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.4.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.decoder.layers.4.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.4.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.4.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.4.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.4.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.4.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.4.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.4.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.4.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.4.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.4.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.4.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.4.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.4.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.4_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.4_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.4_output.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.5:TransformerLayer()) (INPUT:_model.t5_model.decoder.layers.5_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.5_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) (INPUT:_model.t5_model.decoder.layers.5_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.5_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.5_input.1.0_past_key_value:) (INPUT:_model.t5_model.decoder.layers.5_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.5_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.5_input.1.3_use_cache:) (INPUT:_model.t5_model.decoder.layers.5_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.5_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.5_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.5_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.5_input.1.0_past_key_value:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.5_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.5_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.5_input.1.3_use_cache:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.5_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.5_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.5_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.5_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.5_input.1.0_past_key_value:) is not a Tensor, insert_identity transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.5_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.5_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.5_input.1.3_use_cache:) is not a Tensor, insert_identity transformation will be ignored. (MODULE:model.t5_model.decoder.layers.5.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.5.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.5.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.5.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.decoder.layers.5.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.5.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.5.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.5.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.decoder.layers.5.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.decoder.layers.5.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.5.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.5.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.5.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.5.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.5.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.5.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.5.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.5.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.5.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.5.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.5.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.5.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.5.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.5.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.5.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.5.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.5.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.cross_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=True)) (INPUT:_model.t5_model.decoder.layers.5.cross_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.5.cross_attention_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.5.cross_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.5.cross_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.5.cross_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.5.cross_attention_input.1.3_use_cache:) [WARNING](INPUT:_model.t5_model.decoder.layers.5.cross_attention_input.1.4_query_length:) (MODULE:model.t5_model.decoder.layers.5.cross_attention.query:Linear1D(in_features=768, out_features=768, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.5.cross_attention.query_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.5.cross_attention.query.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.5.cross_attention.query_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.cross_attention.key_value:Linear1D(in_features=768, out_features=1536, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.5.cross_attention.key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.5.cross_attention.key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(1536, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.5.cross_attention.key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 1536), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.cross_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.5.cross_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.5.cross_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.cross_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.5.cross_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.5.cross_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.5.cross_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.cross_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.5.cross_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.5.cross_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.5.cross_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.5.cross_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.5.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.5.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.5.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.post_cross_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.5.post_cross_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.5.post_cross_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.5.post_cross_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.mlp:MT5MLP()) (INPUT:_model.t5_model.decoder.layers.5.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.5.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.5.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.5.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.decoder.layers.5.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.5.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.5.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.5.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.5.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.5.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.5.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.5.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.5.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.5.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.5.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.5.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.5.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.5.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.5_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.5_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.5_output.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.6:TransformerLayer()) (INPUT:_model.t5_model.decoder.layers.6_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.6_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) (INPUT:_model.t5_model.decoder.layers.6_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.6_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.6_input.1.0_past_key_value:) (INPUT:_model.t5_model.decoder.layers.6_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.6_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.6_input.1.3_use_cache:) (INPUT:_model.t5_model.decoder.layers.6_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.6_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.6_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.6_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.6_input.1.0_past_key_value:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.6_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.6_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.6_input.1.3_use_cache:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.6_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.6_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.6_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.6_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.6_input.1.0_past_key_value:) is not a Tensor, insert_identity transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.6_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.6_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.6_input.1.3_use_cache:) is not a Tensor, insert_identity transformation will be ignored. (MODULE:model.t5_model.decoder.layers.6.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.6.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.6.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.6.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.decoder.layers.6.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.6.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.6.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.6.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.decoder.layers.6.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.decoder.layers.6.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.6.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.6.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.6.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.6.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.6.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.6.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.6.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.6.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.6.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.6.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.6.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.6.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.6.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.6.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.6.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.6.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.6.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.cross_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=True)) (INPUT:_model.t5_model.decoder.layers.6.cross_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.6.cross_attention_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.6.cross_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.6.cross_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.6.cross_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.6.cross_attention_input.1.3_use_cache:) [WARNING](INPUT:_model.t5_model.decoder.layers.6.cross_attention_input.1.4_query_length:) (MODULE:model.t5_model.decoder.layers.6.cross_attention.query:Linear1D(in_features=768, out_features=768, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.6.cross_attention.query_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.6.cross_attention.query.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.6.cross_attention.query_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.cross_attention.key_value:Linear1D(in_features=768, out_features=1536, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.6.cross_attention.key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.6.cross_attention.key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(1536, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.6.cross_attention.key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 1536), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.cross_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.6.cross_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.6.cross_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.cross_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.6.cross_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.6.cross_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.6.cross_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.cross_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.6.cross_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.6.cross_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.6.cross_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.6.cross_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.6.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.6.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.6.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.post_cross_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.6.post_cross_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.6.post_cross_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.6.post_cross_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.mlp:MT5MLP()) (INPUT:_model.t5_model.decoder.layers.6.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.6.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.6.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.6.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.decoder.layers.6.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.6.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.6.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.6.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.6.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.6.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.6.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.6.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.6.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.6.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.6.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.6.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.6.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.6.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.6_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.6_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.6_output.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.7:TransformerLayer()) (INPUT:_model.t5_model.decoder.layers.7_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.7_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) (INPUT:_model.t5_model.decoder.layers.7_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.7_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.7_input.1.0_past_key_value:) (INPUT:_model.t5_model.decoder.layers.7_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.7_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.7_input.1.3_use_cache:) (INPUT:_model.t5_model.decoder.layers.7_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.7_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.7_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.7_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.7_input.1.0_past_key_value:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.7_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.7_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.7_input.1.3_use_cache:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.7_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.7_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.7_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.7_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.7_input.1.0_past_key_value:) is not a Tensor, insert_identity transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.7_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.7_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.7_input.1.3_use_cache:) is not a Tensor, insert_identity transformation will be ignored. (MODULE:model.t5_model.decoder.layers.7.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.7.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.7.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.7.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.decoder.layers.7.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.7.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.7.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.7.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.decoder.layers.7.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.decoder.layers.7.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.7.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.7.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.7.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.7.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.7.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.7.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.7.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.7.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.7.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.7.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.7.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.7.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.7.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.7.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.7.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.7.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.7.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.cross_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=True)) (INPUT:_model.t5_model.decoder.layers.7.cross_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.7.cross_attention_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.7.cross_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.7.cross_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.7.cross_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.7.cross_attention_input.1.3_use_cache:) [WARNING](INPUT:_model.t5_model.decoder.layers.7.cross_attention_input.1.4_query_length:) (MODULE:model.t5_model.decoder.layers.7.cross_attention.query:Linear1D(in_features=768, out_features=768, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.7.cross_attention.query_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.7.cross_attention.query.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.7.cross_attention.query_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.cross_attention.key_value:Linear1D(in_features=768, out_features=1536, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.7.cross_attention.key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.7.cross_attention.key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(1536, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.7.cross_attention.key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 1536), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.cross_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.7.cross_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.7.cross_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.cross_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.7.cross_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.7.cross_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.7.cross_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.cross_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.7.cross_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.7.cross_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.7.cross_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.7.cross_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.7.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.7.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.7.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.post_cross_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.7.post_cross_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.7.post_cross_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.7.post_cross_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.mlp:MT5MLP()) (INPUT:_model.t5_model.decoder.layers.7.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.7.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.7.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.7.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.decoder.layers.7.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.7.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.7.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.7.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.7.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.7.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.7.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.7.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.7.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.7.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.7.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.7.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.7.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.7.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.7_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.7_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.7_output.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.8:TransformerLayer()) (INPUT:_model.t5_model.decoder.layers.8_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.8_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) (INPUT:_model.t5_model.decoder.layers.8_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.8_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.8_input.1.0_past_key_value:) (INPUT:_model.t5_model.decoder.layers.8_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.8_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.8_input.1.3_use_cache:) (INPUT:_model.t5_model.decoder.layers.8_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.8_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.8_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.8_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.8_input.1.0_past_key_value:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.8_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.8_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.8_input.1.3_use_cache:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.8_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.8_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.8_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.8_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.8_input.1.0_past_key_value:) is not a Tensor, insert_identity transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.8_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.8_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.8_input.1.3_use_cache:) is not a Tensor, insert_identity transformation will be ignored. (MODULE:model.t5_model.decoder.layers.8.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.8.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.8.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.8.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.decoder.layers.8.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.8.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.8.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.8.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.decoder.layers.8.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.decoder.layers.8.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.8.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.8.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.8.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.8.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.8.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.8.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.8.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.8.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.8.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.8.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.8.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.8.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.8.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.8.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.8.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.8.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.8.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.cross_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=True)) (INPUT:_model.t5_model.decoder.layers.8.cross_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.8.cross_attention_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.8.cross_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.8.cross_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.8.cross_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.8.cross_attention_input.1.3_use_cache:) [WARNING](INPUT:_model.t5_model.decoder.layers.8.cross_attention_input.1.4_query_length:) (MODULE:model.t5_model.decoder.layers.8.cross_attention.query:Linear1D(in_features=768, out_features=768, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.8.cross_attention.query_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.8.cross_attention.query.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.8.cross_attention.query_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.cross_attention.key_value:Linear1D(in_features=768, out_features=1536, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.8.cross_attention.key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.8.cross_attention.key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(1536, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.8.cross_attention.key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 1536), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.cross_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.8.cross_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.8.cross_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.cross_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.8.cross_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.8.cross_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.8.cross_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.cross_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.8.cross_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.8.cross_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.8.cross_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.8.cross_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.8.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.8.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.8.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.post_cross_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.8.post_cross_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.8.post_cross_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.8.post_cross_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.mlp:MT5MLP()) (INPUT:_model.t5_model.decoder.layers.8.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.8.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.8.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.8.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.decoder.layers.8.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.8.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.8.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.8.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.8.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.8.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.8.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.8.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.8.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.8.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.8.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.8.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.8.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.8.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.8_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.8_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.8_output.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.9:TransformerLayer()) (INPUT:_model.t5_model.decoder.layers.9_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.9_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) (INPUT:_model.t5_model.decoder.layers.9_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.9_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.9_input.1.0_past_key_value:) (INPUT:_model.t5_model.decoder.layers.9_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.9_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.9_input.1.3_use_cache:) (INPUT:_model.t5_model.decoder.layers.9_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.9_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.9_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.9_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.9_input.1.0_past_key_value:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.9_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.9_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.9_input.1.3_use_cache:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.9_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.9_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.9_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.9_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.9_input.1.0_past_key_value:) is not a Tensor, insert_identity transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.9_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.9_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.9_input.1.3_use_cache:) is not a Tensor, insert_identity transformation will be ignored. (MODULE:model.t5_model.decoder.layers.9.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.9.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.9.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.9.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.decoder.layers.9.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.9.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.9.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.9.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.decoder.layers.9.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.decoder.layers.9.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.9.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.9.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.9.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.9.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.9.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.9.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.9.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.9.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.9.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.9.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.9.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.9.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.9.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.9.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.9.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.9.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.9.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.cross_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=True)) (INPUT:_model.t5_model.decoder.layers.9.cross_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.9.cross_attention_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.9.cross_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.9.cross_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.9.cross_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.9.cross_attention_input.1.3_use_cache:) [WARNING](INPUT:_model.t5_model.decoder.layers.9.cross_attention_input.1.4_query_length:) (MODULE:model.t5_model.decoder.layers.9.cross_attention.query:Linear1D(in_features=768, out_features=768, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.9.cross_attention.query_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.9.cross_attention.query.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.9.cross_attention.query_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.cross_attention.key_value:Linear1D(in_features=768, out_features=1536, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.9.cross_attention.key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.9.cross_attention.key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(1536, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.9.cross_attention.key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 1536), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.cross_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.9.cross_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.9.cross_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.cross_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.9.cross_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.9.cross_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.9.cross_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.cross_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.9.cross_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.9.cross_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.9.cross_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.9.cross_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.9.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.9.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.9.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.post_cross_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.9.post_cross_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.9.post_cross_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.9.post_cross_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.mlp:MT5MLP()) (INPUT:_model.t5_model.decoder.layers.9.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.9.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.9.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.9.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.decoder.layers.9.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.9.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.9.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.9.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.9.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.9.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.9.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.9.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.9.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.9.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.9.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.9.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.9.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.9.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.9_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.9_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.9_output.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.10:TransformerLayer()) (INPUT:_model.t5_model.decoder.layers.10_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.10_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) (INPUT:_model.t5_model.decoder.layers.10_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.10_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.10_input.1.0_past_key_value:) (INPUT:_model.t5_model.decoder.layers.10_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.10_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.10_input.1.3_use_cache:) (INPUT:_model.t5_model.decoder.layers.10_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.10_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.10_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.10_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.10_input.1.0_past_key_value:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.10_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.10_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.10_input.1.3_use_cache:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.10_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.10_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.10_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.10_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.10_input.1.0_past_key_value:) is not a Tensor, insert_identity transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.10_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.10_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.10_input.1.3_use_cache:) is not a Tensor, insert_identity transformation will be ignored. (MODULE:model.t5_model.decoder.layers.10.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.10.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.10.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.10.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.decoder.layers.10.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.10.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.10.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.10.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.decoder.layers.10.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.decoder.layers.10.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.10.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.10.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.10.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.10.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.10.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.10.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.10.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.10.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.10.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.10.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.10.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.10.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.10.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.10.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.10.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.10.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.10.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.cross_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=True)) (INPUT:_model.t5_model.decoder.layers.10.cross_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.10.cross_attention_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.10.cross_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.10.cross_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.10.cross_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.10.cross_attention_input.1.3_use_cache:) [WARNING](INPUT:_model.t5_model.decoder.layers.10.cross_attention_input.1.4_query_length:) (MODULE:model.t5_model.decoder.layers.10.cross_attention.query:Linear1D(in_features=768, out_features=768, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.10.cross_attention.query_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.10.cross_attention.query.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.10.cross_attention.query_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.cross_attention.key_value:Linear1D(in_features=768, out_features=1536, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.10.cross_attention.key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.10.cross_attention.key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(1536, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.10.cross_attention.key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 1536), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.cross_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.10.cross_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.10.cross_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.cross_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.10.cross_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.10.cross_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.10.cross_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.cross_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.10.cross_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.10.cross_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.10.cross_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.10.cross_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.10.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.10.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.10.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.post_cross_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.10.post_cross_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.10.post_cross_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.10.post_cross_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.mlp:MT5MLP()) (INPUT:_model.t5_model.decoder.layers.10.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.10.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.10.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.10.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.decoder.layers.10.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.10.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.10.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.10.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.10.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.10.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.10.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.10.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.10.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.10.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.10.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.10.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.10.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.10.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.10_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.10_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.10_output.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.11:TransformerLayer()) (INPUT:_model.t5_model.decoder.layers.11_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.11_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) (INPUT:_model.t5_model.decoder.layers.11_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.11_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.11_input.1.0_past_key_value:) (INPUT:_model.t5_model.decoder.layers.11_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.11_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.11_input.1.3_use_cache:) (INPUT:_model.t5_model.decoder.layers.11_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.11_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.11_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.11_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.11_input.1.0_past_key_value:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.11_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.layers.11_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_to_global transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.11_input.1.3_use_cache:) is not a Tensor, insert_to_global transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.11_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.11_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.11_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.11_input.0.3_5:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.11_input.1.0_past_key_value:) is not a Tensor, insert_identity transformation will be ignored. (INPUT:_model.t5_model.decoder.layers.11_input.1.1_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.t5_model.decoder.layers.11_input.1.2_encoder_decoder_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) is a Tensor, insert_identity transformation has been done. [WARNING](INPUT:_model.t5_model.decoder.layers.11_input.1.3_use_cache:) is not a Tensor, insert_identity transformation will be ignored. (MODULE:model.t5_model.decoder.layers.11.input_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.11.input_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.11.input_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.11.input_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.self_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=False)) (INPUT:_model.t5_model.decoder.layers.11.self_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.11.self_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 114), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.11.self_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.11.self_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) [WARNING](INPUT:_model.t5_model.decoder.layers.11.self_attention_input.1.3_use_cache:) (MODULE:model.t5_model.decoder.layers.11.self_attention.query_key_value:Linear1D(in_features=768, out_features=2304, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.11.self_attention.query_key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.11.self_attention.query_key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(2304, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.11.self_attention.query_key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 2304), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.self_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.11.self_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.11.self_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.self_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.11.self_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.11.self_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.11.self_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.self_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.11.self_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.11.self_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.11.self_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.11.self_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.11.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.11.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.post_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.11.post_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.11.post_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.11.post_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.cross_attention:MultiheadAttention(hidden_size=768, num_heads=12, is_cross_attention=True)) (INPUT:_model.t5_model.decoder.layers.11.cross_attention_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.11.cross_attention_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.t5_model.decoder.layers.11.cross_attention_input.1.0_attention_mask:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 1, 114, 512), dtype=oneflow.bool)) [WARNING](INPUT:_model.t5_model.decoder.layers.11.cross_attention_input.1.1_past_key_value:) (INPUT:_model.t5_model.decoder.layers.11.cross_attention_input.1.2_position_bias:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) [WARNING](INPUT:_model.t5_model.decoder.layers.11.cross_attention_input.1.3_use_cache:) [WARNING](INPUT:_model.t5_model.decoder.layers.11.cross_attention_input.1.4_query_length:) (MODULE:model.t5_model.decoder.layers.11.cross_attention.query:Linear1D(in_features=768, out_features=768, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.11.cross_attention.query_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.11.cross_attention.query.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.11.cross_attention.query_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.cross_attention.key_value:Linear1D(in_features=768, out_features=1536, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.11.cross_attention.key_value_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 512, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.11.cross_attention.key_value.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(1536, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.11.cross_attention.key_value_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 512, 1536), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.cross_attention.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.11.cross_attention.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.11.cross_attention.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=1)), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.cross_attention.dense:Linear1D(in_features=768, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.11.cross_attention.dense_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.11.cross_attention.dense.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.11.cross_attention.dense_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.cross_attention.output_dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.11.cross_attention.output_dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.11.cross_attention.output_dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.11.cross_attention_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.11.cross_attention_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.layers.11.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.11.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.11.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.post_cross_attention_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.layers.11.post_cross_attention_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.11.post_cross_attention_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.11.post_cross_attention_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.mlp:MT5MLP()) (INPUT:_model.t5_model.decoder.layers.11.mlp_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.mlp.wi_0:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.11.mlp.wi_0_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.11.mlp.wi_0.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.11.mlp.wi_0_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.mlp.activation_func:GeLUTanh()) (INPUT:_model.t5_model.decoder.layers.11.mlp.activation_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.11.mlp.activation_func_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.mlp.wi_1:Linear1D(in_features=768, out_features=3072, bias=False, parallel=col)) (INPUT:_model.t5_model.decoder.layers.11.mlp.wi_1_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.11.mlp.wi_1.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=0)), size=(3072, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.11.mlp.wi_1_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.mlp.wo:Linear1D(in_features=3072, out_features=768, bias=False, parallel=row)) (INPUT:_model.t5_model.decoder.layers.11.mlp.wo_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.split(dim=2)), is_lazy='True', size=(64, 114, 3072), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.layers.11.mlp.wo.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.split(dim=1)), size=(768, 3072), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.decoder.layers.11.mlp.wo_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.mlp.dropout:Dropout(p=0.1, inplace=False)) (INPUT:_model.t5_model.decoder.layers.11.mlp.dropout_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.11.mlp.dropout_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.11.mlp_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.decoder.layers.11.drop_path:Identity()) (INPUT:_model.t5_model.decoder.layers.11.drop_path_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.11.drop_path_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.11_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.11_output.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 114), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model.decoder.layers.11_output.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 12, 114, 512), dtype=oneflow.float32)) (MODULE:model.t5_model.decoder.final_layernorm:LayerNorm()) (INPUT:_model.t5_model.decoder.final_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.decoder.final_layernorm.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(768,), dtype=oneflow.float32, requires_grad=True)) (INPUT:_model.t5_model.decoder.final_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.t5_model.decoder.final_layernorm_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (OUTPUT:_model.t5_model.decoder.final_layernorm_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (MODULE:model.t5_model.lm_head:Linear1D(in_features=768, out_features=12900, bias=False, parallel=data)) (INPUT:_model.t5_model.lm_head_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 768), dtype=oneflow.float32, grad_fn=)) (PARAMETER:model.t5_model.lm_head.weight:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), size=(12900, 768), dtype=oneflow.float32, requires_grad=True)) (OUTPUT:_model.t5_model.lm_head_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 12900), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.t5_model_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 12900), dtype=oneflow.float32, grad_fn=)) (MODULE:model.loss_func:T5Loss()) (INPUT:_model.loss_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 12900), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.loss_func_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.int64)) (INPUT:_model.loss_func_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.int64)) (INPUT:_model.loss_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 12900), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.loss_func_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.int64)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.loss_func_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.int64)) is a Tensor, insert_to_global transformation has been done. (INPUT:_model.loss_func_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 12900), dtype=oneflow.float32, grad_fn=)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.loss_func_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.int64)) is a Tensor, insert_identity transformation has been done. (INPUT:_model.loss_func_input.0.2_4:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.int64)) is a Tensor, insert_identity transformation has been done. (MODULE:model.loss_func.lm_loss:ParallelCrossEntropyLoss()) (INPUT:_model.loss_func.lm_loss_input.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114, 12900), dtype=oneflow.float32, grad_fn=)) (INPUT:_model.loss_func.lm_loss_input.0.1_3:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(64, 114), dtype=oneflow.int64)) (OUTPUT:_model.loss_func.lm_loss_output.0.0_2:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.split(dim=0), oneflow.sbp.broadcast), is_lazy='True', size=(7296,), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model.loss_func_output.0.0.0_masked_lm_loss:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.partial_sum, oneflow.sbp.broadcast), is_lazy='True', size=(), dtype=oneflow.float32, grad_fn=)) (OUTPUT:_model_output.0.0.0_masked_lm_loss:tensor(..., placement=oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3], [4, 5], [6, 7]]), sbp=(oneflow.sbp.partial_sum, oneflow.sbp.broadcast), is_lazy='True', size=(), dtype=oneflow.float32, grad_fn=)) (GRAPH:GraphBase_0:GraphBase) end building graph modules. (GRAPH:GraphBase_0:GraphBase) start building graph outputs. (OUTPUT:_GraphBase_0_output.0.0.0_masked_lm_loss:tensor(..., placement=oneflow.placement(type="cpu", ranks=[[0]]), sbp=(oneflow.sbp.broadcast, oneflow.sbp.broadcast), is_lazy='True', size=(), dtype=oneflow.float32, grad_fn=)) (GRAPH:GraphBase_0:GraphBase) end building graph outputs. (GRAPH:GraphBase_0:GraphBase) start building graph with compile passes. Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda Shape: (64,512,2304) idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature. candidate nd sbp signature are: (in_0) -> (out_0): [ ((S(0), S(0))) -> ((S(0), S(0))), ((S(0), S(2))) -> ((S(0), S(2))), ((S(0), S(1))) -> ((S(0), S(1))), ((S(0), P)) -> ((S(0), P)), ((S(0), B)) -> ((S(0), B)), ((S(2), S(0))) -> ((S(2), S(0))), ((S(2), S(2))) -> ((S(2), S(2))), ((S(2), S(1))) -> ((S(2), S(1))), ((S(2), P)) -> ((S(2), P)), ((S(2), B)) -> ((S(2), B)), ((S(1), S(0))) -> ((S(1), S(0))), ((S(1), S(2))) -> ((S(1), S(2))), ((S(1), S(1))) -> ((S(1), S(1))), ((S(1), P)) -> ((S(1), P)), ((S(1), B)) -> ((S(1), B)), ((P, S(0))) -> ((P, S(0))), ((P, S(2))) -> ((P, S(2))), ((P, S(1))) -> ((P, S(1))), ((P, P)) -> ((P, P)), ((P, B)) -> ((P, B)), ((B, S(0))) -> ((B, S(0))), ((B, S(2))) -> ((B, S(2))), ((B, S(1))) -> ((B, S(1))), ((B, P)) -> ((B, P)), ((B, B)) -> ((B, B)), ], but inputs sbp are: in_0: (S(0), S(2)); select idx: 1 (GRAPH:GraphBase_0:GraphBase) end building graph with compile passes. (GRAPH:GraphBase_0:GraphBase) start re-building graph outputs for optimizatioin. (GRAPH:GraphBase_0:GraphBase) end re-building graph outputs for optimizatioin. (GRAPH:GraphBase_0:GraphBase) building graph Done! Cost time: 219.86s. (GRAPH:GraphBase_0:GraphBase) start building plan. (GRAPH:GraphBase_0:GraphBase) building plan Done! Cost time: 48.31s. timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/10/25 13:51:29.275, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 568 MiB, 11485 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/10/25 13:51:29.276, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 0 %, 12288 MiB, 548 MiB, 11505 MiB 2022/10/25 13:51:29.276, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 568 MiB, 11485 MiB 2022/10/25 13:51:29.277, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 0 %, 12288 MiB, 585 MiB, 11468 MiB 2022/10/25 13:51:29.278, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 0 %, 12288 MiB, 548 MiB, 11505 MiB 2022/10/25 13:51:29.279, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 548 MiB, 11505 MiB 2022/10/25 13:51:29.280, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 0 %, 12288 MiB, 585 MiB, 11468 MiB 2022/10/25 13:51:29.282, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 548 MiB, 11505 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/10/25 13:51:29.285, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 0 %, 12288 MiB, 568 MiB, 11485 MiB 2022/10/25 13:51:29.291, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 0 %, 12288 MiB, 548 MiB, 11505 MiB 2022/10/25 13:51:29.292, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 0 %, 12288 MiB, 585 MiB, 11468 MiB 2022/10/25 13:51:29.293, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 548 MiB, 11505 MiB [10/25 13:51:33 lb.utils.events]:  iteration: 0/220 consumed_samples: 512 total_loss: 9.117 data_time: 0.0462 s/iter lr: 1.00e-08 timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2022/10/25 13:51:33.710, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 100 %, 27 %, 12288 MiB, 551 MiB, 11502 MiB 2022/10/25 13:51:33.711, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 100 %, 20 %, 12288 MiB, 536 MiB, 11517 MiB 2022/10/25 13:51:33.712, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 100 %, 24 %, 12288 MiB, 571 MiB, 11482 MiB 2022/10/25 13:51:33.713, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 100 %, 12 %, 12288 MiB, 531 MiB, 11522 MiB [10/25 13:51:39 lb.utils.events]:  eta: 0:20:32 iteration: 1/220 consumed_samples: 1024 total_loss: 9.118 data_time: 0.0355 s/iter lr: 3.34e-06 [10/25 13:51:43 lb.utils.events]:  eta: 0:15:32 iteration: 2/220 consumed_samples: 1536 total_loss: 9.117 time: 4.2995 s/iter data_time: 0.0335 s/iter total_throughput: 119.08 samples/s lr: 6.67e-06 [10/25 13:51:47 lb.utils.events]:  eta: 0:15:28 iteration: 3/220 consumed_samples: 2048 total_loss: 9.118 time: 4.2979 s/iter data_time: 0.0302 s/iter total_throughput: 119.13 samples/s lr: 1.00e-05 [10/25 13:51:52 lb.utils.events]:  eta: 0:15:24 iteration: 4/220 consumed_samples: 2560 total_loss: 9.118 time: 4.3182 s/iter data_time: 0.0293 s/iter total_throughput: 118.57 samples/s lr: 1.00e-05 [10/25 13:51:56 lb.utils.events]:  eta: 0:15:21 iteration: 5/220 consumed_samples: 3072 total_loss: 9.118 time: 4.3157 s/iter data_time: 0.0281 s/iter total_throughput: 118.64 samples/s lr: 1.00e-05 [10/25 13:52:00 lb.utils.events]:  eta: 0:15:15 iteration: 6/220 consumed_samples: 3584 total_loss: 9.117 time: 4.3042 s/iter data_time: 0.0272 s/iter total_throughput: 118.95 samples/s lr: 1.00e-05 [10/25 13:52:05 lb.utils.events]:  eta: 0:15:11 iteration: 7/220 consumed_samples: 4096 total_loss: 9.117 time: 4.2800 s/iter data_time: 0.0264 s/iter total_throughput: 119.63 samples/s lr: 1.00e-05 [10/25 13:52:09 lb.utils.events]:  eta: 0:15:06 iteration: 8/220 consumed_samples: 4608 total_loss: 9.117 time: 4.2671 s/iter data_time: 0.0259 s/iter total_throughput: 119.99 samples/s lr: 1.00e-05 [10/25 13:52:13 lb.utils.events]:  eta: 0:14:58 iteration: 9/220 consumed_samples: 5120 total_loss: 9.094 time: 4.2644 s/iter data_time: 0.0253 s/iter total_throughput: 120.06 samples/s lr: 1.00e-05 [10/25 13:52:17 lb.utils.events]:  eta: 0:14:57 iteration: 10/220 consumed_samples: 5632 total_loss: 9.072 time: 4.2692 s/iter data_time: 0.0248 s/iter total_throughput: 119.93 samples/s lr: 1.00e-05 [10/25 13:52:22 lb.utils.events]:  eta: 0:14:53 iteration: 11/220 consumed_samples: 6144 total_loss: 9.072 time: 4.2767 s/iter data_time: 0.0245 s/iter total_throughput: 119.72 samples/s lr: 1.00e-05 [10/25 13:52:26 lb.utils.events]:  eta: 0:14:49 iteration: 12/220 consumed_samples: 6656 total_loss: 9.071 time: 4.2827 s/iter data_time: 0.0242 s/iter total_throughput: 119.55 samples/s lr: 1.00e-05 [10/25 13:52:30 lb.utils.events]:  eta: 0:14:45 iteration: 13/220 consumed_samples: 7168 total_loss: 9.029 time: 4.2829 s/iter data_time: 0.0240 s/iter total_throughput: 119.54 samples/s lr: 1.00e-05 [10/25 13:52:35 lb.utils.events]:  eta: 0:14:41 iteration: 14/220 consumed_samples: 7680 total_loss: 8.987 time: 4.2848 s/iter data_time: 0.0237 s/iter total_throughput: 119.49 samples/s lr: 1.00e-05 [10/25 13:52:39 lb.utils.events]:  eta: 0:14:37 iteration: 15/220 consumed_samples: 8192 total_loss: 8.922 time: 4.2896 s/iter data_time: 0.0235 s/iter total_throughput: 119.36 samples/s lr: 1.00e-05 [10/25 13:52:43 lb.utils.events]:  eta: 0:14:34 iteration: 16/220 consumed_samples: 8704 total_loss: 8.858 time: 4.2935 s/iter data_time: 0.0234 s/iter total_throughput: 119.25 samples/s lr: 1.00e-05 [10/25 13:52:48 lb.utils.events]:  eta: 0:14:30 iteration: 17/220 consumed_samples: 9216 total_loss: 8.796 time: 4.2969 s/iter data_time: 0.0241 s/iter total_throughput: 119.16 samples/s lr: 1.00e-05 [10/25 13:52:52 lb.utils.events]:  eta: 0:14:25 iteration: 18/220 consumed_samples: 9728 total_loss: 8.734 time: 4.2974 s/iter data_time: 0.0239 s/iter total_throughput: 119.14 samples/s lr: 1.00e-05 [10/25 13:52:56 lb.utils.events]:  eta: 0:14:21 iteration: 19/220 consumed_samples: 10240 total_loss: 8.672 time: 4.2991 s/iter data_time: 0.0237 s/iter total_throughput: 119.09 samples/s lr: 1.00e-05 [10/25 13:53:01 lb.utils.events]:  eta: 0:14:17 iteration: 20/220 consumed_samples: 10752 total_loss: 8.611 time: 4.2992 s/iter data_time: 0.0226 s/iter total_throughput: 119.09 samples/s lr: 1.00e-05 [10/25 13:53:05 lb.utils.events]:  eta: 0:14:12 iteration: 21/220 consumed_samples: 11264 total_loss: 8.564 time: 4.2971 s/iter data_time: 0.0224 s/iter total_throughput: 119.15 samples/s lr: 1.00e-05 [10/25 13:53:09 lb.utils.events]:  eta: 0:14:08 iteration: 22/220 consumed_samples: 11776 total_loss: 8.516 time: 4.2972 s/iter data_time: 0.0220 s/iter total_throughput: 119.15 samples/s lr: 1.00e-05 [10/25 13:53:13 lb.utils.events]:  eta: 0:14:04 iteration: 23/220 consumed_samples: 12288 total_loss: 8.46 time: 4.2978 s/iter data_time: 0.0221 s/iter total_throughput: 119.13 samples/s lr: 1.00e-05 [10/25 13:53:18 lb.utils.events]:  eta: 0:13:59 iteration: 24/220 consumed_samples: 12800 total_loss: 8.404 time: 4.3010 s/iter data_time: 0.0219 s/iter total_throughput: 119.04 samples/s lr: 1.00e-05 [10/25 13:53:22 lb.utils.events]:  eta: 0:13:55 iteration: 25/220 consumed_samples: 13312 total_loss: 8.368 time: 4.3023 s/iter data_time: 0.0219 s/iter total_throughput: 119.01 samples/s lr: 1.00e-05 [10/25 13:53:26 lb.utils.events]:  eta: 0:13:51 iteration: 26/220 consumed_samples: 13824 total_loss: 8.333 time: 4.3037 s/iter data_time: 0.0219 s/iter total_throughput: 118.97 samples/s lr: 1.00e-05 [10/25 13:53:31 lb.utils.events]:  eta: 0:13:47 iteration: 27/220 consumed_samples: 14336 total_loss: 8.305 time: 4.3064 s/iter data_time: 0.0219 s/iter total_throughput: 118.89 samples/s lr: 1.00e-05 [10/25 13:53:35 lb.utils.events]:  eta: 0:13:42 iteration: 28/220 consumed_samples: 14848 total_loss: 8.277 time: 4.3078 s/iter data_time: 0.0219 s/iter total_throughput: 118.85 samples/s lr: 1.00e-05 [10/25 13:53:39 lb.utils.events]:  eta: 0:13:38 iteration: 29/220 consumed_samples: 15360 total_loss: 8.249 time: 4.3079 s/iter data_time: 0.0221 s/iter total_throughput: 118.85 samples/s lr: 1.00e-05 [10/25 13:53:44 lb.utils.events]:  eta: 0:13:34 iteration: 30/220 consumed_samples: 15872 total_loss: 8.222 time: 4.3095 s/iter data_time: 0.0222 s/iter total_throughput: 118.81 samples/s lr: 1.00e-05 [10/25 13:53:48 lb.utils.events]:  eta: 0:13:30 iteration: 31/220 consumed_samples: 16384 total_loss: 8.209 time: 4.3106 s/iter data_time: 0.0225 s/iter total_throughput: 118.78 samples/s lr: 1.00e-05 [10/25 13:53:53 lb.utils.events]:  eta: 0:13:26 iteration: 32/220 consumed_samples: 16896 total_loss: 8.195 time: 4.3120 s/iter data_time: 0.0227 s/iter total_throughput: 118.74 samples/s lr: 1.00e-05 [10/25 13:53:57 lb.utils.events]:  eta: 0:13:23 iteration: 33/220 consumed_samples: 17408 total_loss: 8.178 time: 4.3132 s/iter data_time: 0.0230 s/iter total_throughput: 118.70 samples/s lr: 1.00e-05 [10/25 13:54:01 lb.utils.events]:  eta: 0:13:20 iteration: 34/220 consumed_samples: 17920 total_loss: 8.162 time: 4.3140 s/iter data_time: 0.0231 s/iter total_throughput: 118.68 samples/s lr: 1.00e-05 [10/25 13:54:06 lb.utils.events]:  eta: 0:13:16 iteration: 35/220 consumed_samples: 18432 total_loss: 8.154 time: 4.3154 s/iter data_time: 0.0239 s/iter total_throughput: 118.65 samples/s lr: 1.00e-05 [10/25 13:54:10 lb.utils.events]:  eta: 0:13:12 iteration: 36/220 consumed_samples: 18944 total_loss: 8.146 time: 4.3162 s/iter data_time: 0.0239 s/iter total_throughput: 118.62 samples/s lr: 1.00e-05 [10/25 13:54:14 lb.utils.events]:  eta: 0:13:08 iteration: 37/220 consumed_samples: 19456 total_loss: 8.137 time: 4.3176 s/iter data_time: 0.0232 s/iter total_throughput: 118.58 samples/s lr: 1.00e-05 [10/25 13:54:19 lb.utils.events]:  eta: 0:13:04 iteration: 38/220 consumed_samples: 19968 total_loss: 8.128 time: 4.3181 s/iter data_time: 0.0235 s/iter total_throughput: 118.57 samples/s lr: 1.00e-05 [10/25 13:54:23 lb.utils.events]:  eta: 0:13:00 iteration: 39/220 consumed_samples: 20480 total_loss: 8.117 time: 4.3188 s/iter data_time: 0.0238 s/iter total_throughput: 118.55 samples/s lr: 1.00e-05 [10/25 13:54:27 lb.utils.events]:  eta: 0:12:56 iteration: 40/220 consumed_samples: 20992 total_loss: 8.107 time: 4.3197 s/iter data_time: 0.0237 s/iter total_throughput: 118.53 samples/s lr: 1.00e-05 [10/25 13:54:32 lb.utils.events]:  eta: 0:12:51 iteration: 41/220 consumed_samples: 21504 total_loss: 8.103 time: 4.3199 s/iter data_time: 0.0237 s/iter total_throughput: 118.52 samples/s lr: 1.00e-05 [10/25 13:54:36 lb.utils.events]:  eta: 0:12:47 iteration: 42/220 consumed_samples: 22016 total_loss: 8.099 time: 4.3208 s/iter data_time: 0.0240 s/iter total_throughput: 118.50 samples/s lr: 1.00e-05 [10/25 13:54:40 lb.utils.events]:  eta: 0:12:43 iteration: 43/220 consumed_samples: 22528 total_loss: 8.097 time: 4.3219 s/iter data_time: 0.0238 s/iter total_throughput: 118.47 samples/s lr: 1.00e-05 [10/25 13:54:45 lb.utils.events]:  eta: 0:12:39 iteration: 44/220 consumed_samples: 23040 total_loss: 8.096 time: 4.3225 s/iter data_time: 0.0436 s/iter total_throughput: 118.45 samples/s lr: 1.00e-05 [10/25 13:54:49 lb.utils.events]:  eta: 0:12:34 iteration: 45/220 consumed_samples: 23552 total_loss: 8.091 time: 4.3224 s/iter data_time: 0.0436 s/iter total_throughput: 118.45 samples/s lr: 1.00e-05 [10/25 13:54:53 lb.utils.events]:  eta: 0:12:30 iteration: 46/220 consumed_samples: 24064 total_loss: 8.087 time: 4.3222 s/iter data_time: 0.0439 s/iter total_throughput: 118.46 samples/s lr: 1.00e-05 [10/25 13:54:58 lb.utils.events]:  eta: 0:12:26 iteration: 47/220 consumed_samples: 24576 total_loss: 8.082 time: 4.3226 s/iter data_time: 0.0439 s/iter total_throughput: 118.45 samples/s lr: 1.00e-05 [10/25 13:55:02 lb.utils.events]:  eta: 0:12:21 iteration: 48/220 consumed_samples: 25088 total_loss: 8.078 time: 4.3235 s/iter data_time: 0.0439 s/iter total_throughput: 118.42 samples/s lr: 1.00e-05 [10/25 13:55:06 lb.utils.events]:  eta: 0:12:17 iteration: 49/220 consumed_samples: 25600 total_loss: 8.076 time: 4.3240 s/iter data_time: 0.0437 s/iter total_throughput: 118.41 samples/s lr: 1.00e-05 [10/25 13:55:11 lb.utils.events]:  eta: 0:12:13 iteration: 50/220 consumed_samples: 26112 total_loss: 8.074 time: 4.3248 s/iter data_time: 0.0437 s/iter total_throughput: 118.39 samples/s lr: 1.00e-05 [10/25 13:55:15 lb.utils.events]:  eta: 0:12:09 iteration: 51/220 consumed_samples: 26624 total_loss: 8.07 time: 4.3255 s/iter data_time: 0.0434 s/iter total_throughput: 118.37 samples/s lr: 1.00e-05 [10/25 13:55:20 lb.utils.events]:  eta: 0:12:05 iteration: 52/220 consumed_samples: 27136 total_loss: 8.066 time: 4.3253 s/iter data_time: 0.0432 s/iter total_throughput: 118.37 samples/s lr: 1.00e-05