[03/05 18:59:41 libai]: Rank of current process: 0. World size: 8
[03/05 18:59:41 libai]: Command line arguments: Namespace(config_file='configs/swin_imagenet.py', eval_only=False, fast_dev_run=False, opts=['model.cfg.hidden_dropout_prob=0.1', 'model.cfg.attention_probs_dropout_prob=0.1', 'model.cfg.bias_dropout_fusion=true', 'model.cfg.hidden_layers=12', 'model.cfg.hidden_size=768', 'model.cfg.num_attention_heads=12', 'model.cfg.intermediate_size=3072', 'model.cfg.ffn_hidden_size=3072', 'model.cfg.head_size=64', 'graph.enabled=true', 'train.dist.pipeline_num_layers=12', 'train.train_micro_batch_size=64', 'train.global_batch_size=2048', 'train.dist.tensor_parallel_size=2', 'train.dist.pipeline_parallel_size=1', 'train.amp.enabled=true', 'train.activation_checkpoint.enabled=true', 'train.num_accumulation_steps=8', 'train.evaluation.enabled=false', 'train.train_iter=220', 'train.train_epoch=0', 'train.log_period=100', 'train.zero_optimization.enabled=true', 'train.zero_optimization.stage=2', 'train.load_weight=', 'train.output_dir=test_logs/oneflow-28/NVIDIA_GeForce_RTX_3080_Ti/7d07caf/LibAI_swin_imagenet_graph_nl12_nah12_hs768_FP16_actrue_DP4_MP2_PP1_zerotrue_stage2_mbs64_gbs2048_acc8_1n8g'], resume=False)
[03/05 18:59:41 libai]: Contents of args.config_file=configs/swin_imagenet.py:
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15mlibai[39m[38;5;15m.[39m[38;5;15mconfig[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mLazyCall[39m
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15m.[39m[38;5;15mcommon[39m[38;5;15m.[39m[38;5;15mmodels[39m[38;5;15m.[39m[38;5;15mswin[39m[38;5;15m.[39m[38;5;15mswin_tiny_patch4_window7_224[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mmodel[39m
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15m.[39m[38;5;15mcommon[39m[38;5;15m.[39m[38;5;15mmodels[39m[38;5;15m.[39m[38;5;15mgraph[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mgraph[39m
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15m.[39m[38;5;15mcommon[39m[38;5;15m.[39m[38;5;15mtrain[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mtrain[39m
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15m.[39m[38;5;15mcommon[39m[38;5;15m.[39m[38;5;15moptim[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15moptim[39m
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15m.[39m[38;5;15mcommon[39m[38;5;15m.[39m[38;5;15mdata[39m[38;5;15m.[39m[38;5;15mimagenet[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mdataloader[39m

[38;5;197mfrom[39m[38;5;15m [39m[38;5;15mflowvision[39m[38;5;15m.[39m[38;5;15mdata[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mMixup[39m
[38;5;197mfrom[39m[38;5;15m [39m[38;5;15mflowvision[39m[38;5;15m.[39m[38;5;15mloss[39m[38;5;15m.[39m[38;5;15mcross_entropy[39m[38;5;15m [39m[38;5;197mimport[39m[38;5;15m [39m[38;5;15mSoftTargetCrossEntropy[39m

[38;5;242m# Refine data path to imagenet[39m
[38;5;15mdataloader[39m[38;5;197m.[39m[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mdataset[39m[38;5;15m[[39m[38;5;141m0[39m[38;5;15m][39m[38;5;197m.[39m[38;5;15mroot[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;186m"[39m[38;5;186m/ssd/dataset/ImageNet/extract[39m[38;5;186m"[39m
[38;5;15mdataloader[39m[38;5;197m.[39m[38;5;15mtest[39m[38;5;15m[[39m[38;5;141m0[39m[38;5;15m][39m[38;5;197m.[39m[38;5;15mdataset[39m[38;5;197m.[39m[38;5;15mroot[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;186m"[39m[38;5;186m/ssd/dataset/ImageNet/extract[39m[38;5;186m"[39m

[38;5;242m# Add Mixup Func[39m
[38;5;15mdataloader[39m[38;5;197m.[39m[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mmixup_func[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;15mLazyCall[39m[38;5;15m([39m[38;5;15mMixup[39m[38;5;15m)[39m[38;5;15m([39m
[38;5;15m    [39m[38;5;15mmixup_alpha[39m[38;5;197m=[39m[38;5;141m0.8[39m[38;5;15m,[39m
[38;5;15m    [39m[38;5;15mcutmix_alpha[39m[38;5;197m=[39m[38;5;141m1.0[39m[38;5;15m,[39m
[38;5;15m    [39m[38;5;15mprob[39m[38;5;197m=[39m[38;5;141m1.0[39m[38;5;15m,[39m
[38;5;15m    [39m[38;5;15mswitch_prob[39m[38;5;197m=[39m[38;5;141m0.5[39m[38;5;15m,[39m
[38;5;15m    [39m[38;5;15mmode[39m[38;5;197m=[39m[38;5;186m"[39m[38;5;186mbatch[39m[38;5;186m"[39m[38;5;15m,[39m
[38;5;15m    [39m[38;5;15mnum_classes[39m[38;5;197m=[39m[38;5;141m1000[39m[38;5;15m,[39m
[38;5;15m)[39m

[38;5;242m# Refine model cfg for vit training on imagenet[39m
[38;5;15mmodel[39m[38;5;197m.[39m[38;5;15mcfg[39m[38;5;197m.[39m[38;5;15mnum_classes[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m1000[39m
[38;5;15mmodel[39m[38;5;197m.[39m[38;5;15mcfg[39m[38;5;197m.[39m[38;5;15mloss_func[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;15mSoftTargetCrossEntropy[39m[38;5;15m([39m[38;5;15m)[39m
[38;5;242m# Refine optimizer cfg for vit model[39m
[38;5;15moptim[39m[38;5;197m.[39m[38;5;15mlr[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m1e-3[39m
[38;5;15moptim[39m[38;5;197m.[39m[38;5;15meps[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m1e-8[39m
[38;5;15moptim[39m[38;5;197m.[39m[38;5;15mweight_decay[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m0.05[39m
[38;5;15moptim[39m[38;5;197m.[39m[38;5;15mparams[39m[38;5;197m.[39m[38;5;15mclip_grad_max_norm[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;81mNone[39m
[38;5;15moptim[39m[38;5;197m.[39m[38;5;15mparams[39m[38;5;197m.[39m[38;5;15mclip_grad_norm_type[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;81mNone[39m

[38;5;242m# Refine train cfg for vit model[39m
[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mtrain_micro_batch_size[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m128[39m
[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mtest_micro_batch_size[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m128[39m
[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mtrain_epoch[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m300[39m
[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mwarmup_ratio[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m20[39m[38;5;15m [39m[38;5;197m/[39m[38;5;15m [39m[38;5;141m300[39m
[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15meval_period[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m1562[39m
[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mlog_period[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m100[39m

[38;5;242m# Scheduler[39m
[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mscheduler[39m[38;5;197m.[39m[38;5;15mwarmup_factor[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m0.001[39m
[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mscheduler[39m[38;5;197m.[39m[38;5;15malpha[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;141m0.01[39m
[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mscheduler[39m[38;5;197m.[39m[38;5;15mwarmup_method[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;186m"[39m[38;5;186mlinear[39m[38;5;186m"[39m

[38;5;242m# Set fp16 ON[39m
[38;5;15mtrain[39m[38;5;197m.[39m[38;5;15mamp[39m[38;5;197m.[39m[38;5;15menabled[39m[38;5;15m [39m[38;5;197m=[39m[38;5;15m [39m[38;5;81mTrue[39m

[03/05 18:59:41 libai]: Full config saved to test_logs/oneflow-28/NVIDIA_GeForce_RTX_3080_Ti/7d07caf/LibAI_swin_imagenet_graph_nl12_nah12_hs768_FP16_actrue_DP4_MP2_PP1_zerotrue_stage2_mbs64_gbs2048_acc8_1n8g/config.yaml
[03/05 18:59:41 lb.engine.default]: > compiling dataset index builder ...
make: Entering directory '/ssd/home/ouyangyu/libai_week_test/libai/libai/data/data_utils'
make: Nothing to be done for 'default'.
make: Leaving directory '/ssd/home/ouyangyu/libai_week_test/libai/libai/data/data_utils'
[03/05 18:59:41 lb.engine.default]: >>> done with dataset index builder. Compilation time: 0.053 seconds
[03/05 18:59:41 lb.engine.default]: >>> done with compiling. Compilation time: 0.055 seconds
[03/05 18:59:41 lb.engine.default]: Prepare training, validating, testing set
[03/05 18:59:45 lb.engine.default]: Prepare testing set
[03/05 18:59:54 lb.engine.default]: Auto-scaling the config to train.train_iter=220, train.warmup_iter=15
[03/05 18:59:54 libai]: > Start building model...
W20230305 18:59:57.695628 1950954 eager_local_op_interpreter.cpp:256] Casting a local tensor to a global tensor with Broadcast sbp will modify the data of input! If you want to keep the input local tensor unchanged, please set the arg copy to True.
[03/05 18:59:59 lb.engine.default]: Model:
SwinTransformer(
  (patch_embed): PatchEmbed(
    (proj): Conv2d(3, 96, kernel_size=(4, 4), stride=(4, 4))
    (norm): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
  )
  (pos_drop): Dropout(p=0.0, inplace=False)
  (layers): ModuleList(
    (0): BasicLayer(
      (blocks): ModuleList(
        (0): SwinTransformerBlock(
          (norm1): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
          (attn): WindowAttention(
            (qkv): Linear1D(in_features=96, out_features=288, bias=True, parallel=data)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear1D(in_features=96, out_features=96, bias=True, parallel=data)
            (proj_drop): Dropout(p=0.0, inplace=False)
            (softmax): Softmax(dim=-1)
          )
          (drop_path): Identity()
          (norm2): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.0
            (dense_h_to_4h): Linear1D(in_features=96, out_features=384, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=384, out_features=96, bias=True, parallel=row)
          )
        )
        (1): SwinTransformerBlock(
          (norm1): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
          (attn): WindowAttention(
            (qkv): Linear1D(in_features=96, out_features=288, bias=True, parallel=data)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear1D(in_features=96, out_features=96, bias=True, parallel=data)
            (proj_drop): Dropout(p=0.0, inplace=False)
            (softmax): Softmax(dim=-1)
          )
          (drop_path): DropPath()
          (norm2): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.0
            (dense_h_to_4h): Linear1D(in_features=96, out_features=384, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=384, out_features=96, bias=True, parallel=row)
          )
        )
      )
      (downsample): PatchMerging(
        (reduction): Linear1D(in_features=384, out_features=192, bias=False, parallel=data)
        (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
      )
    )
    (1): BasicLayer(
      (blocks): ModuleList(
        (0): SwinTransformerBlock(
          (norm1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
          (attn): WindowAttention(
            (qkv): Linear1D(in_features=192, out_features=576, bias=True, parallel=data)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear1D(in_features=192, out_features=192, bias=True, parallel=data)
            (proj_drop): Dropout(p=0.0, inplace=False)
            (softmax): Softmax(dim=-1)
          )
          (drop_path): DropPath()
          (norm2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.0
            (dense_h_to_4h): Linear1D(in_features=192, out_features=768, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=768, out_features=192, bias=True, parallel=row)
          )
        )
        (1): SwinTransformerBlock(
          (norm1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
          (attn): WindowAttention(
            (qkv): Linear1D(in_features=192, out_features=576, bias=True, parallel=data)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear1D(in_features=192, out_features=192, bias=True, parallel=data)
            (proj_drop): Dropout(p=0.0, inplace=False)
            (softmax): Softmax(dim=-1)
          )
          (drop_path): DropPath()
          (norm2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.0
            (dense_h_to_4h): Linear1D(in_features=192, out_features=768, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=768, out_features=192, bias=True, parallel=row)
          )
        )
      )
      (downsample): PatchMerging(
        (reduction): Linear1D(in_features=768, out_features=384, bias=False, parallel=data)
        (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      )
    )
    (2): BasicLayer(
      (blocks): ModuleList(
        (0): SwinTransformerBlock(
          (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (attn): WindowAttention(
            (qkv): Linear1D(in_features=384, out_features=1152, bias=True, parallel=data)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear1D(in_features=384, out_features=384, bias=True, parallel=data)
            (proj_drop): Dropout(p=0.0, inplace=False)
            (softmax): Softmax(dim=-1)
          )
          (drop_path): DropPath()
          (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.0
            (dense_h_to_4h): Linear1D(in_features=384, out_features=1536, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=1536, out_features=384, bias=True, parallel=row)
          )
        )
        (1): SwinTransformerBlock(
          (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (attn): WindowAttention(
            (qkv): Linear1D(in_features=384, out_features=1152, bias=True, parallel=data)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear1D(in_features=384, out_features=384, bias=True, parallel=data)
            (proj_drop): Dropout(p=0.0, inplace=False)
            (softmax): Softmax(dim=-1)
          )
          (drop_path): DropPath()
          (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.0
            (dense_h_to_4h): Linear1D(in_features=384, out_features=1536, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=1536, out_features=384, bias=True, parallel=row)
          )
        )
        (2): SwinTransformerBlock(
          (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (attn): WindowAttention(
            (qkv): Linear1D(in_features=384, out_features=1152, bias=True, parallel=data)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear1D(in_features=384, out_features=384, bias=True, parallel=data)
            (proj_drop): Dropout(p=0.0, inplace=False)
            (softmax): Softmax(dim=-1)
          )
          (drop_path): DropPath()
          (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.0
            (dense_h_to_4h): Linear1D(in_features=384, out_features=1536, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=1536, out_features=384, bias=True, parallel=row)
          )
        )
        (3): SwinTransformerBlock(
          (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (attn): WindowAttention(
            (qkv): Linear1D(in_features=384, out_features=1152, bias=True, parallel=data)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear1D(in_features=384, out_features=384, bias=True, parallel=data)
            (proj_drop): Dropout(p=0.0, inplace=False)
            (softmax): Softmax(dim=-1)
          )
          (drop_path): DropPath()
          (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.0
            (dense_h_to_4h): Linear1D(in_features=384, out_features=1536, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=1536, out_features=384, bias=True, parallel=row)
          )
        )
        (4): SwinTransformerBlock(
          (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (attn): WindowAttention(
            (qkv): Linear1D(in_features=384, out_features=1152, bias=True, parallel=data)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear1D(in_features=384, out_features=384, bias=True, parallel=data)
            (proj_drop): Dropout(p=0.0, inplace=False)
            (softmax): Softmax(dim=-1)
          )
          (drop_path): DropPath()
          (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.0
            (dense_h_to_4h): Linear1D(in_features=384, out_features=1536, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=1536, out_features=384, bias=True, parallel=row)
          )
        )
        (5): SwinTransformerBlock(
          (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (attn): WindowAttention(
            (qkv): Linear1D(in_features=384, out_features=1152, bias=True, parallel=data)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear1D(in_features=384, out_features=384, bias=True, parallel=data)
            (proj_drop): Dropout(p=0.0, inplace=False)
            (softmax): Softmax(dim=-1)
          )
          (drop_path): DropPath()
          (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.0
            (dense_h_to_4h): Linear1D(in_features=384, out_features=1536, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=1536, out_features=384, bias=True, parallel=row)
          )
        )
      )
      (downsample): PatchMerging(
        (reduction): Linear1D(in_features=1536, out_features=768, bias=False, parallel=data)
        (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
      )
    )
    (3): BasicLayer(
      (blocks): ModuleList(
        (0): SwinTransformerBlock(
          (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (attn): WindowAttention(
            (qkv): Linear1D(in_features=768, out_features=2304, bias=True, parallel=data)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear1D(in_features=768, out_features=768, bias=True, parallel=data)
            (proj_drop): Dropout(p=0.0, inplace=False)
            (softmax): Softmax(dim=-1)
          )
          (drop_path): DropPath()
          (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.0
            (dense_h_to_4h): Linear1D(in_features=768, out_features=3072, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=3072, out_features=768, bias=True, parallel=row)
          )
        )
        (1): SwinTransformerBlock(
          (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (attn): WindowAttention(
            (qkv): Linear1D(in_features=768, out_features=2304, bias=True, parallel=data)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear1D(in_features=768, out_features=768, bias=True, parallel=data)
            (proj_drop): Dropout(p=0.0, inplace=False)
            (softmax): Softmax(dim=-1)
          )
          (drop_path): DropPath()
          (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            bias_gelu_fusion=True, bias_dropout_fusion=True, dropout=0.0
            (dense_h_to_4h): Linear1D(in_features=768, out_features=3072, bias=True, parallel=col)
            (dense_4h_to_h): Linear1D(in_features=3072, out_features=768, bias=True, parallel=row)
          )
        )
      )
    )
  )
  (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (avgpool): AdaptiveAvgPool1d()
  (head): Linear1D(in_features=768, out_features=1000, bias=True, parallel=data)
  (loss_func): SoftTargetCrossEntropy()
)
[03/05 18:59:59 libai]: >>> done with building model. Building time: 4.711 seconds
[03/05 18:59:59 lb.engine.trainer]: Starting training from iteration 0
[03/05 19:00:03 lb.models.utils.graph_base]: Start compiling the train graph which may take some time. Please wait for a moment ...
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/03/05 19:07:46.257, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 8178 MiB, 3875 MiB
2023/03/05 19:07:46.265, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 2 %, 0 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.274, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 10 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/03/05 19:07:46.275, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 15 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/03/05 19:07:46.280, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 8178 MiB, 3875 MiB
2023/03/05 19:07:46.282, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 17 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.281, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 8178 MiB, 3875 MiB
2023/03/05 19:07:46.280, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 8178 MiB, 3875 MiB
2023/03/05 19:07:46.289, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 8178 MiB, 3875 MiB
2023/03/05 19:07:46.291, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 100 %, 0 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.293, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 33 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.294, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 100 %, 0 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.297, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 100 %, 0 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.300, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 100 %, 0 %, 12288 MiB, 8561 MiB, 3492 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/03/05 19:07:46.301, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 10 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.306, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 10 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.303, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 17 %, 0 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.307, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 10 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.310, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 10 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.313, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 15 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.3122023/03/05 19:07:46.3142023/03/05 19:07:46.315, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 35 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 8178 MiB, 3875 MiB
, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 15 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.317, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 15 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.325, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 15 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.330, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 17 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.332, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 100 %, 0 %, 12288 MiB, 8561 MiB, 3492 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/03/05 19:07:46.332, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 17 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.332, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 17 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.337, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 17 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.339, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 33 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.341, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 33 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.341, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 10 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.342, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 33 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.341, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 8178 MiB, 3875 MiB
2023/03/05 19:07:46.344, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 33 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.346, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.347, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.348, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.348, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 100 %, 0 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.348, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 15 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.353, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.3582023/03/05 19:07:46.360, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 35 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.361, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 10 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.358, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 35 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 35 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.361, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.363, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 35 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.366, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 15 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.367, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 33 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.374, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 0 %, 0 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.379, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.384, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 33 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.385, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 35 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.396, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 1 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:46.403, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 35 %, 1 %, 12288 MiB, 8561 MiB, 3492 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/03/05 19:07:49.467, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 41 %, 9 %, 12288 MiB, 8178 MiB, 3875 MiB
2023/03/05 19:07:49.469, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 97 %, 24 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:49.470, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 64 %, 16 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:49.477, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 41 %, 9 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:49.492, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 88 %, 23 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:49.493, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 55 %, 15 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:49.494, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 71 %, 28 %, 12288 MiB, 8561 MiB, 3492 MiB
2023/03/05 19:07:49.501, NVIDIA GeForce RTX 3080 Ti, 515.65.01, 63 %, 17 %, 12288 MiB, 8561 MiB, 3492 MiB
[03/05 19:07:54 lb.utils.events]:  eta: 0:08:26  iteration: 99/220  consumed_samples: 204800  total_loss: 6.893  time: 4.3606 s/iter  data_time: 1.3422 s/iter total_throughput: 469.66 samples/s lr: 5.82e-04  
[03/05 19:15:45 lb.utils.events]:  eta: 0:01:31  iteration: 199/220  consumed_samples: 409600  total_loss: 6.865  time: 4.5405 s/iter  data_time: 1.3739 s/iter total_throughput: 451.05 samples/s lr: 3.21e-05  
[03/05 19:17:16 lb.utils.events]:  eta: 0:00:00  iteration: 219/220  consumed_samples: 450560  total_loss: 6.856  time: 4.5413 s/iter  data_time: 1.3660 s/iter total_throughput: 450.98 samples/s lr: 1.01e-05  
[03/05 19:17:16 lb.engine.hooks]: Overall training speed: 218 iterations in 0:16:29 (4.5413 s / it)
[03/05 19:17:16 lb.engine.hooks]: Total training time: 0:16:30 (0:00:00 on hooks)
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
oneflow-version(git_commit)=0.9.1.dev20230304+cu117
oneflow-commit(git_commit)=7d07caf
oneflow-libai(git_commit)=50a973dc5de635b8613ad7666c073c763e238850