

===============================================
Starting run 1/2 with 1 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 1
Model has 3,292,672 parameters
Epoch 1: loss: 1.759, accuracy: 27.53%, lr: 3.00e-05
Epoch 2: loss: 1.603, accuracy: 32.71%, lr: 6.00e-05
Epoch 3: loss: 1.561, accuracy: 34.26%, lr: 9.00e-05
Epoch 4: loss: 1.529, accuracy: 35.46%, lr: 1.20e-04
Epoch 5: loss: 1.493, accuracy: 37.51%, lr: 1.50e-04
Epoch 6: loss: 1.468, accuracy: 38.30%, lr: 1.80e-04
Epoch 7: loss: 1.443, accuracy: 39.59%, lr: 2.10e-04
Epoch 8: loss: 1.413, accuracy: 41.06%, lr: 2.40e-04
Epoch 9: loss: 1.379, accuracy: 42.70%, lr: 2.70e-04
Epoch 10: loss: 1.350, accuracy: 43.95%, lr: 3.00e-04
Epoch 11: loss: 1.322, accuracy: 45.14%, lr: 2.98e-04
Epoch 12: loss: 1.308, accuracy: 45.70%, lr: 2.93e-04
Epoch 13: loss: 1.291, accuracy: 46.39%, lr: 2.84e-04
Epoch 14: loss: 1.276, accuracy: 46.97%, lr: 2.71e-04
Epoch 15: loss: 1.265, accuracy: 47.40%, lr: 2.56e-04
Epoch 16: loss: 1.257, accuracy: 47.73%, lr: 2.38e-04
Epoch 17: loss: 1.249, accuracy: 48.00%, lr: 2.18e-04
Epoch 18: loss: 1.243, accuracy: 48.19%, lr: 1.96e-04
Epoch 19: loss: 1.238, accuracy: 48.36%, lr: 1.73e-04
Epoch 20: loss: 1.234, accuracy: 48.51%, lr: 1.50e-04
Epoch 21: loss: 1.231, accuracy: 48.62%, lr: 1.27e-04
Epoch 22: loss: 1.227, accuracy: 48.76%, lr: 1.04e-04
Epoch 23: loss: 1.224, accuracy: 48.85%, lr: 8.19e-05
Epoch 24: loss: 1.222, accuracy: 48.92%, lr: 6.18e-05
Epoch 25: loss: 1.220, accuracy: 49.02%, lr: 4.39e-05
Epoch 26: loss: 1.218, accuracy: 49.06%, lr: 2.86e-05
Epoch 27: loss: 1.217, accuracy: 49.12%, lr: 1.63e-05
Epoch 28: loss: 1.216, accuracy: 49.15%, lr: 7.34e-06
Epoch 29: loss: 1.215, accuracy: 49.17%, lr: 1.85e-06
Epoch 30: loss: 1.215, accuracy: 49.16%, lr: 0.00e+00


===============================================
Starting run 2/2 with 1 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 1
Model has 3,292,672 parameters
Epoch 1: loss: 1.755, accuracy: 27.32%, lr: 3.00e-05
Epoch 2: loss: 1.604, accuracy: 32.70%, lr: 6.00e-05
Epoch 3: loss: 1.555, accuracy: 34.49%, lr: 9.00e-05
Epoch 4: loss: 1.515, accuracy: 36.13%, lr: 1.20e-04
Epoch 5: loss: 1.481, accuracy: 37.76%, lr: 1.50e-04
Epoch 6: loss: 1.455, accuracy: 38.99%, lr: 1.80e-04
Epoch 7: loss: 1.434, accuracy: 39.87%, lr: 2.10e-04
Epoch 8: loss: 1.405, accuracy: 41.40%, lr: 2.40e-04
Epoch 9: loss: 1.382, accuracy: 42.32%, lr: 2.70e-04
Epoch 10: loss: 1.358, accuracy: 43.59%, lr: 3.00e-04
Epoch 11: loss: 1.319, accuracy: 45.37%, lr: 2.98e-04
Epoch 12: loss: 1.293, accuracy: 46.44%, lr: 2.93e-04
Epoch 13: loss: 1.272, accuracy: 47.33%, lr: 2.84e-04
Epoch 14: loss: 1.256, accuracy: 47.97%, lr: 2.71e-04
Epoch 15: loss: 1.245, accuracy: 48.42%, lr: 2.56e-04
Epoch 16: loss: 1.234, accuracy: 48.83%, lr: 2.38e-04
Epoch 17: loss: 1.226, accuracy: 49.12%, lr: 2.18e-04
Epoch 18: loss: 1.219, accuracy: 49.38%, lr: 1.96e-04
Epoch 19: loss: 1.214, accuracy: 49.55%, lr: 1.73e-04
Epoch 20: loss: 1.210, accuracy: 49.70%, lr: 1.50e-04
Epoch 21: loss: 1.205, accuracy: 49.85%, lr: 1.27e-04
Epoch 22: loss: 1.202, accuracy: 49.98%, lr: 1.04e-04
Epoch 23: loss: 1.198, accuracy: 50.11%, lr: 8.19e-05
Epoch 24: loss: 1.196, accuracy: 50.19%, lr: 6.18e-05
Epoch 25: loss: 1.194, accuracy: 50.26%, lr: 4.39e-05
Epoch 26: loss: 1.192, accuracy: 50.31%, lr: 2.86e-05
Epoch 27: loss: 1.191, accuracy: 50.35%, lr: 1.63e-05
Epoch 28: loss: 1.190, accuracy: 50.38%, lr: 7.34e-06
Epoch 29: loss: 1.190, accuracy: 50.39%, lr: 1.85e-06
Epoch 30: loss: 1.189, accuracy: 50.40%, lr: 0.00e+00


===============================================
Starting run 1/2 with 2 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 2
Model has 6,445,056 parameters
Epoch 1: loss: 1.684, accuracy: 30.43%, lr: 3.00e-05
Epoch 2: loss: 1.303, accuracy: 45.90%, lr: 6.00e-05
Epoch 3: loss: 1.147, accuracy: 52.10%, lr: 9.00e-05
Epoch 4: loss: 1.082, accuracy: 54.76%, lr: 1.20e-04
Epoch 5: loss: 1.034, accuracy: 56.61%, lr: 1.50e-04
Epoch 6: loss: 0.974, accuracy: 59.32%, lr: 1.80e-04
Epoch 7: loss: 0.910, accuracy: 62.45%, lr: 2.10e-04
Epoch 8: loss: 0.911, accuracy: 63.26%, lr: 2.40e-04
Epoch 9: loss: 0.818, accuracy: 66.35%, lr: 2.70e-04
Epoch 10: loss: 0.804, accuracy: 67.21%, lr: 3.00e-04
Epoch 11: loss: 0.760, accuracy: 68.81%, lr: 2.98e-04
Epoch 12: loss: 0.739, accuracy: 69.70%, lr: 2.93e-04
Epoch 13: loss: 0.727, accuracy: 70.06%, lr: 2.84e-04
Epoch 14: loss: 0.709, accuracy: 70.79%, lr: 2.71e-04
Epoch 15: loss: 0.686, accuracy: 71.68%, lr: 2.56e-04
Epoch 16: loss: 0.671, accuracy: 72.27%, lr: 2.38e-04
Epoch 17: loss: 0.666, accuracy: 72.47%, lr: 2.18e-04
Epoch 18: loss: 0.650, accuracy: 73.03%, lr: 1.96e-04
Epoch 19: loss: 0.641, accuracy: 73.38%, lr: 1.73e-04
Epoch 20: loss: 0.631, accuracy: 73.73%, lr: 1.50e-04
Epoch 21: loss: 0.622, accuracy: 74.10%, lr: 1.27e-04
Epoch 22: loss: 0.616, accuracy: 74.33%, lr: 1.04e-04
Epoch 23: loss: 0.609, accuracy: 74.60%, lr: 8.19e-05
Epoch 24: loss: 0.605, accuracy: 74.74%, lr: 6.18e-05
Epoch 25: loss: 0.600, accuracy: 74.90%, lr: 4.39e-05
Epoch 26: loss: 0.597, accuracy: 75.01%, lr: 2.86e-05
Epoch 27: loss: 0.595, accuracy: 75.08%, lr: 1.63e-05
Epoch 28: loss: 0.594, accuracy: 75.14%, lr: 7.34e-06
Epoch 29: loss: 0.593, accuracy: 75.16%, lr: 1.85e-06
Epoch 30: loss: 0.593, accuracy: 75.17%, lr: 0.00e+00


===============================================
Starting run 2/2 with 2 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 2
Model has 6,445,056 parameters
Epoch 1: loss: 1.725, accuracy: 28.56%, lr: 3.00e-05
Epoch 2: loss: 1.307, accuracy: 46.21%, lr: 6.00e-05
Epoch 3: loss: 1.137, accuracy: 52.45%, lr: 9.00e-05
Epoch 4: loss: 1.085, accuracy: 54.51%, lr: 1.20e-04
Epoch 5: loss: 1.051, accuracy: 55.85%, lr: 1.50e-04
Epoch 6: loss: 1.011, accuracy: 57.88%, lr: 1.80e-04
Epoch 7: loss: 0.964, accuracy: 59.87%, lr: 2.10e-04
Epoch 8: loss: 0.915, accuracy: 61.85%, lr: 2.40e-04
Epoch 9: loss: 0.887, accuracy: 62.86%, lr: 2.70e-04
Epoch 10: loss: 0.854, accuracy: 64.30%, lr: 3.00e-04
Epoch 11: loss: 0.838, accuracy: 64.84%, lr: 2.98e-04
Epoch 12: loss: 0.818, accuracy: 65.54%, lr: 2.93e-04
Epoch 13: loss: 0.801, accuracy: 66.26%, lr: 2.84e-04
Epoch 14: loss: 0.744, accuracy: 69.16%, lr: 2.71e-04
Epoch 15: loss: 0.672, accuracy: 72.49%, lr: 2.56e-04
Epoch 16: loss: 0.638, accuracy: 73.97%, lr: 2.38e-04
Epoch 17: loss: 0.592, accuracy: 75.71%, lr: 2.18e-04
Epoch 18: loss: 0.571, accuracy: 76.56%, lr: 1.96e-04
Epoch 19: loss: 0.553, accuracy: 77.29%, lr: 1.73e-04
Epoch 20: loss: 0.538, accuracy: 77.87%, lr: 1.50e-04
Epoch 21: loss: 0.528, accuracy: 78.25%, lr: 1.27e-04
Epoch 22: loss: 0.520, accuracy: 78.57%, lr: 1.04e-04
Epoch 23: loss: 0.508, accuracy: 79.07%, lr: 8.19e-05
Epoch 24: loss: 0.501, accuracy: 79.35%, lr: 6.18e-05
Epoch 25: loss: 0.495, accuracy: 79.60%, lr: 4.39e-05
Epoch 26: loss: 0.490, accuracy: 79.86%, lr: 2.86e-05
Epoch 27: loss: 0.486, accuracy: 80.03%, lr: 1.63e-05
Epoch 28: loss: 0.484, accuracy: 80.12%, lr: 7.34e-06
Epoch 29: loss: 0.483, accuracy: 80.16%, lr: 1.85e-06
Epoch 30: loss: 0.482, accuracy: 80.18%, lr: 0.00e+00


===============================================
Starting run 1/2 with 3 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 3
Model has 9,597,440 parameters
Epoch 1: loss: 1.722, accuracy: 28.74%, lr: 3.00e-05
Epoch 2: loss: 1.251, accuracy: 48.53%, lr: 6.00e-05
Epoch 3: loss: 0.693, accuracy: 71.41%, lr: 9.00e-05
Epoch 4: loss: 0.607, accuracy: 74.62%, lr: 1.20e-04
Epoch 5: loss: 0.557, accuracy: 76.60%, lr: 1.50e-04
Epoch 6: loss: 0.453, accuracy: 81.10%, lr: 1.80e-04
Epoch 7: loss: 0.330, accuracy: 85.66%, lr: 2.10e-04
Epoch 8: loss: 0.262, accuracy: 88.62%, lr: 2.40e-04
Epoch 9: loss: 0.210, accuracy: 90.75%, lr: 2.70e-04
Epoch 10: loss: 0.131, accuracy: 94.54%, lr: 3.00e-04
Epoch 11: loss: 0.096, accuracy: 96.25%, lr: 2.98e-04
Epoch 12: loss: 0.090, accuracy: 96.51%, lr: 2.93e-04
Epoch 13: loss: 0.082, accuracy: 96.80%, lr: 2.84e-04
Epoch 14: loss: 0.078, accuracy: 97.10%, lr: 2.71e-04
Epoch 15: loss: 0.072, accuracy: 97.16%, lr: 2.56e-04
Epoch 16: loss: 0.059, accuracy: 97.71%, lr: 2.38e-04
Epoch 17: loss: 0.049, accuracy: 97.98%, lr: 2.18e-04
Epoch 18: loss: 0.045, accuracy: 98.05%, lr: 1.96e-04
Epoch 19: loss: 0.045, accuracy: 98.06%, lr: 1.73e-04
Epoch 20: loss: 0.042, accuracy: 98.12%, lr: 1.50e-04
Epoch 21: loss: 0.040, accuracy: 98.12%, lr: 1.27e-04
Epoch 22: loss: 0.039, accuracy: 98.17%, lr: 1.04e-04
Epoch 23: loss: 0.037, accuracy: 98.26%, lr: 8.19e-05
Epoch 24: loss: 0.035, accuracy: 98.34%, lr: 6.18e-05
Epoch 25: loss: 0.033, accuracy: 98.37%, lr: 4.39e-05
Epoch 26: loss: 0.032, accuracy: 98.40%, lr: 2.86e-05
Epoch 27: loss: 0.031, accuracy: 98.42%, lr: 1.63e-05
Epoch 28: loss: 0.031, accuracy: 98.43%, lr: 7.34e-06
Epoch 29: loss: 0.031, accuracy: 98.43%, lr: 1.85e-06
Epoch 30: loss: 0.031, accuracy: 98.45%, lr: 0.00e+00


===============================================
Starting run 2/2 with 3 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 3
Model has 9,597,440 parameters
Epoch 1: loss: 1.671, accuracy: 30.95%, lr: 3.00e-05
Epoch 2: loss: 1.270, accuracy: 47.41%, lr: 6.00e-05
Epoch 3: loss: 0.791, accuracy: 67.30%, lr: 9.00e-05
Epoch 4: loss: 0.626, accuracy: 73.24%, lr: 1.20e-04
Epoch 5: loss: 0.517, accuracy: 78.03%, lr: 1.50e-04
Epoch 6: loss: 0.444, accuracy: 81.33%, lr: 1.80e-04
Epoch 7: loss: 0.388, accuracy: 83.86%, lr: 2.10e-04
Epoch 8: loss: 0.320, accuracy: 86.54%, lr: 2.40e-04
Epoch 9: loss: 0.311, accuracy: 86.89%, lr: 2.70e-04
Epoch 10: loss: 0.301, accuracy: 87.21%, lr: 3.00e-04
Epoch 11: loss: 0.310, accuracy: 87.03%, lr: 2.98e-04
Epoch 12: loss: 0.308, accuracy: 87.08%, lr: 2.93e-04
Epoch 13: loss: 0.299, accuracy: 87.22%, lr: 2.84e-04
Epoch 14: loss: 0.296, accuracy: 87.28%, lr: 2.71e-04
Epoch 15: loss: 0.295, accuracy: 87.26%, lr: 2.56e-04
Epoch 16: loss: 0.293, accuracy: 87.29%, lr: 2.38e-04
Epoch 17: loss: 0.292, accuracy: 87.31%, lr: 2.18e-04
Epoch 18: loss: 0.290, accuracy: 87.32%, lr: 1.96e-04
Epoch 19: loss: 0.291, accuracy: 87.29%, lr: 1.73e-04
Epoch 20: loss: 0.290, accuracy: 87.31%, lr: 1.50e-04
Epoch 21: loss: 0.289, accuracy: 87.31%, lr: 1.27e-04
Epoch 22: loss: 0.289, accuracy: 87.31%, lr: 1.04e-04
Epoch 23: loss: 0.289, accuracy: 87.32%, lr: 8.19e-05
Epoch 24: loss: 0.289, accuracy: 87.31%, lr: 6.18e-05
Epoch 25: loss: 0.289, accuracy: 87.31%, lr: 4.39e-05
Epoch 26: loss: 0.288, accuracy: 87.32%, lr: 2.86e-05
Epoch 27: loss: 0.288, accuracy: 87.33%, lr: 1.63e-05
Epoch 28: loss: 0.288, accuracy: 87.31%, lr: 7.34e-06
Epoch 29: loss: 0.288, accuracy: 87.32%, lr: 1.85e-06
Epoch 30: loss: 0.288, accuracy: 87.31%, lr: 0.00e+00


===============================================
Starting run 1/2 with 4 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 4
Model has 12,749,824 parameters
Epoch 1: loss: 1.632, accuracy: 32.49%, lr: 3.00e-05
Epoch 2: loss: 1.147, accuracy: 52.47%, lr: 6.00e-05
Epoch 3: loss: 0.798, accuracy: 67.07%, lr: 9.00e-05
Epoch 4: loss: 0.352, accuracy: 85.48%, lr: 1.20e-04
Epoch 5: loss: 0.249, accuracy: 89.95%, lr: 1.50e-04
Epoch 6: loss: 0.238, accuracy: 90.22%, lr: 1.80e-04
Epoch 7: loss: 0.236, accuracy: 90.22%, lr: 2.10e-04
Epoch 8: loss: 0.236, accuracy: 90.22%, lr: 2.40e-04
Epoch 9: loss: 0.237, accuracy: 90.22%, lr: 2.70e-04
Epoch 10: loss: 0.236, accuracy: 90.23%, lr: 3.00e-04
Epoch 11: loss: 0.233, accuracy: 90.25%, lr: 2.98e-04
Epoch 12: loss: 0.036, accuracy: 98.93%, lr: 2.93e-04
Epoch 13: loss: 0.004, accuracy: 99.94%, lr: 2.84e-04
Epoch 14: loss: 0.005, accuracy: 99.92%, lr: 2.71e-04
Epoch 15: loss: 0.001, accuracy: 100.00%, lr: 2.56e-04
Epoch 16: loss: 0.000, accuracy: 100.00%, lr: 2.38e-04
Epoch 17: loss: 0.000, accuracy: 100.00%, lr: 2.18e-04
Epoch 18: loss: 0.000, accuracy: 100.00%, lr: 1.96e-04
Epoch 19: loss: 0.000, accuracy: 100.00%, lr: 1.73e-04
Epoch 20: loss: 0.000, accuracy: 100.00%, lr: 1.50e-04
Epoch 21: loss: 0.009, accuracy: 99.78%, lr: 1.27e-04
Epoch 22: loss: 0.000, accuracy: 100.00%, lr: 1.04e-04
Epoch 23: loss: 0.000, accuracy: 100.00%, lr: 8.19e-05
Epoch 24: loss: 0.000, accuracy: 100.00%, lr: 6.18e-05
Epoch 25: loss: 0.000, accuracy: 100.00%, lr: 4.39e-05
Epoch 26: loss: 0.000, accuracy: 100.00%, lr: 2.86e-05
Epoch 27: loss: 0.000, accuracy: 100.00%, lr: 1.63e-05
Epoch 28: loss: 0.000, accuracy: 100.00%, lr: 7.34e-06
Epoch 29: loss: 0.000, accuracy: 100.00%, lr: 1.85e-06
Epoch 30: loss: 0.000, accuracy: 100.00%, lr: 0.00e+00


===============================================
Starting run 2/2 with 4 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 4
Model has 12,749,824 parameters
Epoch 1: loss: 1.663, accuracy: 30.84%, lr: 3.00e-05
Epoch 2: loss: 0.917, accuracy: 62.75%, lr: 6.00e-05
Epoch 3: loss: 0.669, accuracy: 73.94%, lr: 9.00e-05
Epoch 4: loss: 0.509, accuracy: 79.05%, lr: 1.20e-04
Epoch 5: loss: 0.302, accuracy: 87.32%, lr: 1.50e-04
Epoch 6: loss: 0.936, accuracy: 79.62%, lr: 1.80e-04
Epoch 7: loss: 0.091, accuracy: 96.53%, lr: 2.10e-04
Epoch 8: loss: 0.350, accuracy: 91.06%, lr: 2.40e-04
Epoch 9: loss: 0.021, accuracy: 99.75%, lr: 2.70e-04
Epoch 10: loss: 0.045, accuracy: 98.85%, lr: 3.00e-04
Epoch 11: loss: 0.029, accuracy: 98.96%, lr: 2.98e-04
Epoch 12: loss: 0.014, accuracy: 99.74%, lr: 2.93e-04
Epoch 13: loss: 0.022, accuracy: 99.31%, lr: 2.84e-04
Epoch 14: loss: 0.079, accuracy: 99.18%, lr: 2.71e-04
Epoch 15: loss: 0.030, accuracy: 98.95%, lr: 2.56e-04
Epoch 16: loss: 0.009, accuracy: 99.85%, lr: 2.38e-04
Epoch 17: loss: 0.004, accuracy: 99.93%, lr: 2.18e-04
Epoch 18: loss: 0.015, accuracy: 99.49%, lr: 1.96e-04
Epoch 19: loss: 0.004, accuracy: 99.85%, lr: 1.73e-04
Epoch 20: loss: 0.001, accuracy: 99.99%, lr: 1.50e-04
Epoch 21: loss: 0.000, accuracy: 100.00%, lr: 1.27e-04
Epoch 22: loss: 0.000, accuracy: 100.00%, lr: 1.04e-04
Epoch 23: loss: 0.001, accuracy: 99.97%, lr: 8.19e-05
Epoch 24: loss: 0.000, accuracy: 99.99%, lr: 6.18e-05
Epoch 25: loss: 0.000, accuracy: 100.00%, lr: 4.39e-05
Epoch 26: loss: 0.000, accuracy: 100.00%, lr: 2.86e-05
Epoch 27: loss: 0.000, accuracy: 100.00%, lr: 1.63e-05
Epoch 28: loss: 0.000, accuracy: 100.00%, lr: 7.34e-06
Epoch 29: loss: 0.000, accuracy: 100.00%, lr: 1.85e-06
Epoch 30: loss: 0.000, accuracy: 100.00%, lr: 0.00e+00


===============================================
Starting run 1/2 with 5 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 5
Model has 15,902,208 parameters
Epoch 1: loss: 1.614, accuracy: 33.14%, lr: 3.00e-05
Epoch 2: loss: 0.900, accuracy: 63.42%, lr: 6.00e-05
Epoch 3: loss: 0.347, accuracy: 86.26%, lr: 9.00e-05
Epoch 4: loss: 0.308, accuracy: 87.33%, lr: 1.20e-04
Epoch 5: loss: 0.150, accuracy: 93.98%, lr: 1.50e-04
Epoch 6: loss: 0.147, accuracy: 93.96%, lr: 1.80e-04
Epoch 7: loss: 0.159, accuracy: 93.70%, lr: 2.10e-04
Epoch 8: loss: 0.149, accuracy: 93.85%, lr: 2.40e-04
Epoch 9: loss: 0.124, accuracy: 94.72%, lr: 2.70e-04
Epoch 10: loss: 0.040, accuracy: 98.99%, lr: 3.00e-04
Epoch 11: loss: 0.053, accuracy: 99.03%, lr: 2.98e-04
Epoch 12: loss: 0.003, accuracy: 99.97%, lr: 2.93e-04
Epoch 13: loss: 0.077, accuracy: 98.22%, lr: 2.84e-04
Epoch 14: loss: 0.005, accuracy: 99.90%, lr: 2.71e-04
Epoch 15: loss: 0.002, accuracy: 99.99%, lr: 2.56e-04
Epoch 16: loss: 0.001, accuracy: 99.98%, lr: 2.38e-04
Epoch 17: loss: 0.008, accuracy: 99.76%, lr: 2.18e-04
Epoch 18: loss: 0.000, accuracy: 99.99%, lr: 1.96e-04
Epoch 19: loss: 0.001, accuracy: 99.98%, lr: 1.73e-04
Epoch 20: loss: 0.000, accuracy: 100.00%, lr: 1.50e-04
Epoch 21: loss: 0.000, accuracy: 99.99%, lr: 1.27e-04
Epoch 22: loss: 0.000, accuracy: 100.00%, lr: 1.04e-04
Epoch 23: loss: 0.000, accuracy: 100.00%, lr: 8.19e-05
Epoch 24: loss: 0.000, accuracy: 100.00%, lr: 6.18e-05
Epoch 25: loss: 0.000, accuracy: 100.00%, lr: 4.39e-05
Epoch 26: loss: 0.000, accuracy: 100.00%, lr: 2.86e-05
Epoch 27: loss: 0.000, accuracy: 100.00%, lr: 1.63e-05
Epoch 28: loss: 0.000, accuracy: 100.00%, lr: 7.34e-06
Epoch 29: loss: 0.000, accuracy: 100.00%, lr: 1.85e-06
Epoch 30: loss: 0.000, accuracy: 100.00%, lr: 0.00e+00


===============================================
Starting run 2/2 with 5 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 5
Model has 15,902,208 parameters
Epoch 1: loss: 1.612, accuracy: 33.68%, lr: 3.00e-05
Epoch 2: loss: 0.848, accuracy: 65.58%, lr: 6.00e-05
Epoch 3: loss: 0.418, accuracy: 83.28%, lr: 9.00e-05
Epoch 4: loss: 0.392, accuracy: 83.94%, lr: 1.20e-04
Epoch 5: loss: 0.364, accuracy: 85.60%, lr: 1.50e-04
Epoch 6: loss: 0.153, accuracy: 93.10%, lr: 1.80e-04
Epoch 7: loss: 0.040, accuracy: 98.39%, lr: 2.10e-04
Epoch 8: loss: 0.001, accuracy: 100.00%, lr: 2.40e-04
Epoch 9: loss: 0.004, accuracy: 99.94%, lr: 2.70e-04
Epoch 10: loss: 0.013, accuracy: 99.81%, lr: 3.00e-04
Epoch 11: loss: 0.003, accuracy: 99.95%, lr: 2.98e-04
Epoch 12: loss: 0.000, accuracy: 100.00%, lr: 2.93e-04
Epoch 13: loss: 0.000, accuracy: 100.00%, lr: 2.84e-04
Epoch 14: loss: 0.002, accuracy: 99.98%, lr: 2.71e-04
Epoch 15: loss: 0.000, accuracy: 100.00%, lr: 2.56e-04
Epoch 16: loss: 0.003, accuracy: 99.96%, lr: 2.38e-04
Epoch 17: loss: 0.001, accuracy: 99.99%, lr: 2.18e-04
Epoch 18: loss: 0.001, accuracy: 99.99%, lr: 1.96e-04
Epoch 19: loss: 0.000, accuracy: 100.00%, lr: 1.73e-04
Epoch 20: loss: 0.001, accuracy: 99.99%, lr: 1.50e-04
Epoch 21: loss: 0.000, accuracy: 100.00%, lr: 1.27e-04
Epoch 22: loss: 0.000, accuracy: 100.00%, lr: 1.04e-04
Epoch 23: loss: 0.000, accuracy: 100.00%, lr: 8.19e-05
Epoch 24: loss: 0.000, accuracy: 100.00%, lr: 6.18e-05
Epoch 25: loss: 0.000, accuracy: 100.00%, lr: 4.39e-05
Epoch 26: loss: 0.000, accuracy: 100.00%, lr: 2.86e-05
Epoch 27: loss: 0.000, accuracy: 100.00%, lr: 1.63e-05
Epoch 28: loss: 0.000, accuracy: 100.00%, lr: 7.34e-06
Epoch 29: loss: 0.000, accuracy: 100.00%, lr: 1.85e-06
Epoch 30: loss: 0.000, accuracy: 100.00%, lr: 0.00e+00


===============================================
Starting run 1/2 with 1 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: True
  layers: 1
Model has 3,292,672 parameters
Epoch 1: loss: 1.792, accuracy: 25.46%, lr: 3.00e-05
Epoch 2: loss: 1.080, accuracy: 56.36%, lr: 6.00e-05
Epoch 3: loss: 0.004, accuracy: 100.00%, lr: 9.00e-05
Epoch 4: loss: 0.000, accuracy: 100.00%, lr: 1.20e-04
Epoch 5: loss: 0.000, accuracy: 100.00%, lr: 1.50e-04
Epoch 6: loss: 0.000, accuracy: 100.00%, lr: 1.80e-04
Epoch 7: loss: 0.000, accuracy: 100.00%, lr: 2.10e-04
Epoch 8: loss: 0.000, accuracy: 100.00%, lr: 2.40e-04
Epoch 9: loss: 0.000, accuracy: 100.00%, lr: 2.70e-04
Epoch 10: loss: 0.000, accuracy: 100.00%, lr: 3.00e-04
Epoch 11: loss: 0.000, accuracy: 100.00%, lr: 2.98e-04
Epoch 12: loss: 0.000, accuracy: 100.00%, lr: 2.93e-04
Epoch 13: loss: 0.000, accuracy: 100.00%, lr: 2.84e-04
Epoch 14: loss: 0.000, accuracy: 100.00%, lr: 2.71e-04
Epoch 15: loss: 0.000, accuracy: 100.00%, lr: 2.56e-04
Epoch 16: loss: 0.000, accuracy: 100.00%, lr: 2.38e-04
Epoch 17: loss: 0.000, accuracy: 100.00%, lr: 2.18e-04
Epoch 18: loss: 0.000, accuracy: 100.00%, lr: 1.96e-04
Epoch 19: loss: 0.000, accuracy: 100.00%, lr: 1.73e-04
Epoch 20: loss: 0.000, accuracy: 100.00%, lr: 1.50e-04
Epoch 21: loss: 0.000, accuracy: 100.00%, lr: 1.27e-04
Epoch 22: loss: 0.000, accuracy: 100.00%, lr: 1.04e-04
Epoch 23: loss: 0.000, accuracy: 100.00%, lr: 8.19e-05
Epoch 24: loss: 0.000, accuracy: 100.00%, lr: 6.18e-05
Epoch 25: loss: 0.000, accuracy: 100.00%, lr: 4.39e-05
Epoch 26: loss: 0.000, accuracy: 100.00%, lr: 2.86e-05
Epoch 27: loss: 0.000, accuracy: 100.00%, lr: 1.63e-05
Epoch 28: loss: 0.000, accuracy: 100.00%, lr: 7.34e-06
Epoch 29: loss: 0.000, accuracy: 100.00%, lr: 1.85e-06
Epoch 30: loss: 0.000, accuracy: 100.00%, lr: 0.00e+00


===============================================
Starting run 2/2 with 1 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: True
  layers: 1
Model has 3,292,672 parameters
Epoch 1: loss: 1.793, accuracy: 25.21%, lr: 3.00e-05
Epoch 2: loss: 1.204, accuracy: 51.05%, lr: 6.00e-05
Epoch 3: loss: 0.006, accuracy: 100.00%, lr: 9.00e-05
Epoch 4: loss: 0.001, accuracy: 100.00%, lr: 1.20e-04
Epoch 5: loss: 0.000, accuracy: 100.00%, lr: 1.50e-04
Epoch 6: loss: 0.000, accuracy: 100.00%, lr: 1.80e-04
Epoch 7: loss: 0.000, accuracy: 100.00%, lr: 2.10e-04
Epoch 8: loss: 0.000, accuracy: 100.00%, lr: 2.40e-04
Epoch 9: loss: 0.000, accuracy: 100.00%, lr: 2.70e-04
Epoch 10: loss: 0.000, accuracy: 100.00%, lr: 3.00e-04
Epoch 11: loss: 0.000, accuracy: 100.00%, lr: 2.98e-04
Epoch 12: loss: 0.000, accuracy: 100.00%, lr: 2.93e-04
Epoch 13: loss: 0.000, accuracy: 100.00%, lr: 2.84e-04
Epoch 14: loss: 0.000, accuracy: 100.00%, lr: 2.71e-04
Epoch 15: loss: 0.000, accuracy: 100.00%, lr: 2.56e-04
Epoch 16: loss: 0.000, accuracy: 100.00%, lr: 2.38e-04
Epoch 17: loss: 0.000, accuracy: 100.00%, lr: 2.18e-04
Epoch 18: loss: 0.000, accuracy: 100.00%, lr: 1.96e-04
Epoch 19: loss: 0.000, accuracy: 100.00%, lr: 1.73e-04
Epoch 20: loss: 0.000, accuracy: 100.00%, lr: 1.50e-04
Epoch 21: loss: 0.000, accuracy: 100.00%, lr: 1.27e-04
Epoch 22: loss: 0.000, accuracy: 100.00%, lr: 1.04e-04
Epoch 23: loss: 0.000, accuracy: 100.00%, lr: 8.19e-05
Epoch 24: loss: 0.000, accuracy: 100.00%, lr: 6.18e-05
Epoch 25: loss: 0.000, accuracy: 100.00%, lr: 4.39e-05
Epoch 26: loss: 0.000, accuracy: 100.00%, lr: 2.86e-05
Epoch 27: loss: 0.000, accuracy: 100.00%, lr: 1.63e-05
Epoch 28: loss: 0.000, accuracy: 100.00%, lr: 7.34e-06
Epoch 29: loss: 0.000, accuracy: 100.00%, lr: 1.85e-06
Epoch 30: loss: 0.000, accuracy: 100.00%, lr: 0.00e+00
