

===============================================
Starting run 1/2 with 1 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 1
Model has 3,292,672 parameters
Epoch 1: loss: 1.757, accuracy: 27.64%, lr: 3.00e-05
Epoch 2: loss: 1.603, accuracy: 32.70%, lr: 6.00e-05
Epoch 3: loss: 1.564, accuracy: 34.21%, lr: 9.00e-05
Epoch 4: loss: 1.528, accuracy: 35.61%, lr: 1.20e-04
Epoch 5: loss: 1.510, accuracy: 36.40%, lr: 1.50e-04
Epoch 6: loss: 1.478, accuracy: 37.88%, lr: 1.80e-04
Epoch 7: loss: 1.447, accuracy: 39.34%, lr: 2.10e-04
Epoch 8: loss: 1.423, accuracy: 40.54%, lr: 2.40e-04
Epoch 9: loss: 1.397, accuracy: 41.82%, lr: 2.70e-04
Epoch 10: loss: 1.360, accuracy: 43.66%, lr: 3.00e-04
Epoch 11: loss: 1.334, accuracy: 44.70%, lr: 2.98e-04
Epoch 12: loss: 1.311, accuracy: 45.70%, lr: 2.93e-04
Epoch 13: loss: 1.291, accuracy: 46.52%, lr: 2.84e-04
Epoch 14: loss: 1.272, accuracy: 47.29%, lr: 2.71e-04
Epoch 15: loss: 1.258, accuracy: 47.92%, lr: 2.56e-04
Epoch 16: loss: 1.246, accuracy: 48.37%, lr: 2.38e-04
Epoch 17: loss: 1.238, accuracy: 48.73%, lr: 2.18e-04
Epoch 18: loss: 1.230, accuracy: 48.99%, lr: 1.96e-04
Epoch 19: loss: 1.224, accuracy: 49.22%, lr: 1.73e-04
Epoch 20: loss: 1.217, accuracy: 49.50%, lr: 1.50e-04
Epoch 21: loss: 1.211, accuracy: 49.72%, lr: 1.27e-04
Epoch 22: loss: 1.208, accuracy: 49.85%, lr: 1.04e-04
Epoch 23: loss: 1.204, accuracy: 49.98%, lr: 8.19e-05
Epoch 24: loss: 1.201, accuracy: 50.06%, lr: 6.18e-05
Epoch 25: loss: 1.198, accuracy: 50.13%, lr: 4.39e-05
Epoch 26: loss: 1.197, accuracy: 50.22%, lr: 2.86e-05
Epoch 27: loss: 1.195, accuracy: 50.24%, lr: 1.63e-05
Epoch 28: loss: 1.194, accuracy: 50.27%, lr: 7.34e-06
Epoch 29: loss: 1.194, accuracy: 50.29%, lr: 1.85e-06
Epoch 30: loss: 1.194, accuracy: 50.31%, lr: 0.00e+00


===============================================
Starting run 2/2 with 1 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 1
Model has 3,292,672 parameters
Epoch 1: loss: 1.750, accuracy: 27.19%, lr: 3.00e-05
Epoch 2: loss: 1.601, accuracy: 32.82%, lr: 6.00e-05
Epoch 3: loss: 1.562, accuracy: 34.18%, lr: 9.00e-05
Epoch 4: loss: 1.518, accuracy: 35.96%, lr: 1.20e-04
Epoch 5: loss: 1.480, accuracy: 37.83%, lr: 1.50e-04
Epoch 6: loss: 1.458, accuracy: 38.80%, lr: 1.80e-04
Epoch 7: loss: 1.432, accuracy: 40.09%, lr: 2.10e-04
Epoch 8: loss: 1.403, accuracy: 41.61%, lr: 2.40e-04
Epoch 9: loss: 1.373, accuracy: 43.08%, lr: 2.70e-04
Epoch 10: loss: 1.338, accuracy: 44.62%, lr: 3.00e-04
Epoch 11: loss: 1.310, accuracy: 45.76%, lr: 2.98e-04
Epoch 12: loss: 1.292, accuracy: 46.44%, lr: 2.93e-04
Epoch 13: loss: 1.276, accuracy: 47.07%, lr: 2.84e-04
Epoch 14: loss: 1.264, accuracy: 47.51%, lr: 2.71e-04
Epoch 15: loss: 1.254, accuracy: 47.90%, lr: 2.56e-04
Epoch 16: loss: 1.246, accuracy: 48.19%, lr: 2.38e-04
Epoch 17: loss: 1.236, accuracy: 48.61%, lr: 2.18e-04
Epoch 18: loss: 1.231, accuracy: 48.78%, lr: 1.96e-04
Epoch 19: loss: 1.226, accuracy: 48.98%, lr: 1.73e-04
Epoch 20: loss: 1.220, accuracy: 49.18%, lr: 1.50e-04
Epoch 21: loss: 1.216, accuracy: 49.33%, lr: 1.27e-04
Epoch 22: loss: 1.213, accuracy: 49.44%, lr: 1.04e-04
Epoch 23: loss: 1.209, accuracy: 49.57%, lr: 8.19e-05
Epoch 24: loss: 1.207, accuracy: 49.65%, lr: 6.18e-05
Epoch 25: loss: 1.204, accuracy: 49.73%, lr: 4.39e-05
Epoch 26: loss: 1.202, accuracy: 49.80%, lr: 2.86e-05
Epoch 27: loss: 1.201, accuracy: 49.84%, lr: 1.63e-05
Epoch 28: loss: 1.200, accuracy: 49.84%, lr: 7.34e-06
Epoch 29: loss: 1.200, accuracy: 49.86%, lr: 1.85e-06
Epoch 30: loss: 1.200, accuracy: 49.87%, lr: 0.00e+00


===============================================
Starting run 1/2 with 2 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 2
Model has 6,445,056 parameters
Epoch 1: loss: 1.690, accuracy: 30.36%, lr: 3.00e-05
Epoch 2: loss: 1.264, accuracy: 47.85%, lr: 6.00e-05
Epoch 3: loss: 1.142, accuracy: 52.35%, lr: 9.00e-05
Epoch 4: loss: 1.060, accuracy: 55.69%, lr: 1.20e-04
Epoch 5: loss: 1.003, accuracy: 58.13%, lr: 1.50e-04
Epoch 6: loss: 0.953, accuracy: 60.05%, lr: 1.80e-04
Epoch 7: loss: 0.889, accuracy: 62.96%, lr: 2.10e-04
Epoch 8: loss: 0.860, accuracy: 64.13%, lr: 2.40e-04
Epoch 9: loss: 0.831, accuracy: 65.39%, lr: 2.70e-04
Epoch 10: loss: 0.781, accuracy: 67.61%, lr: 3.00e-04
Epoch 11: loss: 0.742, accuracy: 69.22%, lr: 2.98e-04
Epoch 12: loss: 0.715, accuracy: 70.33%, lr: 2.93e-04
Epoch 13: loss: 0.693, accuracy: 71.22%, lr: 2.84e-04
Epoch 14: loss: 0.653, accuracy: 73.01%, lr: 2.71e-04
Epoch 15: loss: 0.617, accuracy: 74.62%, lr: 2.56e-04
Epoch 16: loss: 0.596, accuracy: 75.50%, lr: 2.38e-04
Epoch 17: loss: 0.579, accuracy: 76.19%, lr: 2.18e-04
Epoch 18: loss: 0.562, accuracy: 76.92%, lr: 1.96e-04
Epoch 19: loss: 0.547, accuracy: 77.50%, lr: 1.73e-04
Epoch 20: loss: 0.534, accuracy: 77.99%, lr: 1.50e-04
Epoch 21: loss: 0.525, accuracy: 78.34%, lr: 1.27e-04
Epoch 22: loss: 0.518, accuracy: 78.63%, lr: 1.04e-04
Epoch 23: loss: 0.510, accuracy: 78.94%, lr: 8.19e-05
Epoch 24: loss: 0.498, accuracy: 79.41%, lr: 6.18e-05
Epoch 25: loss: 0.489, accuracy: 79.78%, lr: 4.39e-05
Epoch 26: loss: 0.482, accuracy: 80.10%, lr: 2.86e-05
Epoch 27: loss: 0.477, accuracy: 80.37%, lr: 1.63e-05
Epoch 28: loss: 0.474, accuracy: 80.54%, lr: 7.34e-06
Epoch 29: loss: 0.472, accuracy: 80.64%, lr: 1.85e-06
Epoch 30: loss: 0.472, accuracy: 80.65%, lr: 0.00e+00


===============================================
Starting run 2/2 with 2 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 2
Model has 6,445,056 parameters
Epoch 1: loss: 1.703, accuracy: 29.99%, lr: 3.00e-05
Epoch 2: loss: 1.264, accuracy: 48.54%, lr: 6.00e-05
Epoch 3: loss: 1.147, accuracy: 52.13%, lr: 9.00e-05
Epoch 4: loss: 1.095, accuracy: 54.13%, lr: 1.20e-04
Epoch 5: loss: 1.029, accuracy: 56.89%, lr: 1.50e-04
Epoch 6: loss: 0.978, accuracy: 59.16%, lr: 1.80e-04
Epoch 7: loss: 0.906, accuracy: 62.57%, lr: 2.10e-04
Epoch 8: loss: 0.859, accuracy: 64.42%, lr: 2.40e-04
Epoch 9: loss: 0.828, accuracy: 65.83%, lr: 2.70e-04
Epoch 10: loss: 0.807, accuracy: 66.86%, lr: 3.00e-04
Epoch 11: loss: 0.768, accuracy: 68.38%, lr: 2.98e-04
Epoch 12: loss: 0.742, accuracy: 69.51%, lr: 2.93e-04
Epoch 13: loss: 0.720, accuracy: 70.28%, lr: 2.84e-04
Epoch 14: loss: 0.694, accuracy: 71.28%, lr: 2.71e-04
Epoch 15: loss: 0.674, accuracy: 72.10%, lr: 2.56e-04
Epoch 16: loss: 0.662, accuracy: 72.60%, lr: 2.38e-04
Epoch 17: loss: 0.653, accuracy: 72.99%, lr: 2.18e-04
Epoch 18: loss: 0.646, accuracy: 73.22%, lr: 1.96e-04
Epoch 19: loss: 0.641, accuracy: 73.39%, lr: 1.73e-04
Epoch 20: loss: 0.634, accuracy: 73.67%, lr: 1.50e-04
Epoch 21: loss: 0.629, accuracy: 73.80%, lr: 1.27e-04
Epoch 22: loss: 0.627, accuracy: 73.89%, lr: 1.04e-04
Epoch 23: loss: 0.624, accuracy: 74.00%, lr: 8.19e-05
Epoch 24: loss: 0.620, accuracy: 74.11%, lr: 6.18e-05
Epoch 25: loss: 0.618, accuracy: 74.21%, lr: 4.39e-05
Epoch 26: loss: 0.616, accuracy: 74.28%, lr: 2.86e-05
Epoch 27: loss: 0.614, accuracy: 74.36%, lr: 1.63e-05
Epoch 28: loss: 0.613, accuracy: 74.40%, lr: 7.34e-06
Epoch 29: loss: 0.613, accuracy: 74.42%, lr: 1.85e-06
Epoch 30: loss: 0.612, accuracy: 74.43%, lr: 0.00e+00


===============================================
Starting run 1/2 with 3 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 3
Model has 9,597,440 parameters
Epoch 1: loss: 1.701, accuracy: 29.60%, lr: 3.00e-05
Epoch 2: loss: 1.110, accuracy: 54.66%, lr: 6.00e-05
Epoch 3: loss: 0.757, accuracy: 68.64%, lr: 9.00e-05
Epoch 4: loss: 0.659, accuracy: 72.03%, lr: 1.20e-04
Epoch 5: loss: 0.605, accuracy: 73.79%, lr: 1.50e-04
Epoch 6: loss: 0.567, accuracy: 76.33%, lr: 1.80e-04
Epoch 7: loss: 0.449, accuracy: 81.63%, lr: 2.10e-04
Epoch 8: loss: 0.406, accuracy: 82.94%, lr: 2.40e-04
Epoch 9: loss: 0.399, accuracy: 83.17%, lr: 2.70e-04
Epoch 10: loss: 0.405, accuracy: 83.23%, lr: 3.00e-04
Epoch 11: loss: 0.380, accuracy: 83.84%, lr: 2.98e-04
Epoch 12: loss: 0.374, accuracy: 84.05%, lr: 2.93e-04
Epoch 13: loss: 0.374, accuracy: 84.01%, lr: 2.84e-04
Epoch 14: loss: 0.361, accuracy: 84.46%, lr: 2.71e-04
Epoch 15: loss: 0.399, accuracy: 83.67%, lr: 2.56e-04
Epoch 16: loss: 0.348, accuracy: 84.89%, lr: 2.38e-04
Epoch 17: loss: 0.345, accuracy: 84.98%, lr: 2.18e-04
Epoch 18: loss: 0.342, accuracy: 85.08%, lr: 1.96e-04
Epoch 19: loss: 0.341, accuracy: 85.09%, lr: 1.73e-04
Epoch 20: loss: 0.340, accuracy: 85.10%, lr: 1.50e-04
Epoch 21: loss: 0.340, accuracy: 85.11%, lr: 1.27e-04
Epoch 22: loss: 0.339, accuracy: 85.11%, lr: 1.04e-04
Epoch 23: loss: 0.339, accuracy: 85.10%, lr: 8.19e-05
Epoch 24: loss: 0.339, accuracy: 85.10%, lr: 6.18e-05
Epoch 25: loss: 0.339, accuracy: 85.11%, lr: 4.39e-05
Epoch 26: loss: 0.339, accuracy: 85.11%, lr: 2.86e-05
Epoch 27: loss: 0.338, accuracy: 85.10%, lr: 1.63e-05
Epoch 28: loss: 0.338, accuracy: 85.12%, lr: 7.34e-06
Epoch 29: loss: 0.338, accuracy: 85.11%, lr: 1.85e-06
Epoch 30: loss: 0.338, accuracy: 85.11%, lr: 0.00e+00


===============================================
Starting run 2/2 with 3 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 3
Model has 9,597,440 parameters
Epoch 1: loss: 1.716, accuracy: 28.72%, lr: 3.00e-05
Epoch 2: loss: 1.193, accuracy: 50.46%, lr: 6.00e-05
Epoch 3: loss: 0.792, accuracy: 67.00%, lr: 9.00e-05
Epoch 4: loss: 0.679, accuracy: 71.63%, lr: 1.20e-04
Epoch 5: loss: 0.572, accuracy: 76.06%, lr: 1.50e-04
Epoch 6: loss: 0.485, accuracy: 79.90%, lr: 1.80e-04
Epoch 7: loss: 0.470, accuracy: 80.34%, lr: 2.10e-04
Epoch 8: loss: 0.469, accuracy: 80.37%, lr: 2.40e-04
Epoch 9: loss: 0.453, accuracy: 80.91%, lr: 2.70e-04
Epoch 10: loss: 0.448, accuracy: 81.11%, lr: 3.00e-04
Epoch 11: loss: 0.444, accuracy: 81.15%, lr: 2.98e-04
Epoch 12: loss: 0.451, accuracy: 80.88%, lr: 2.93e-04
Epoch 13: loss: 0.442, accuracy: 81.14%, lr: 2.84e-04
Epoch 14: loss: 0.442, accuracy: 81.13%, lr: 2.71e-04
Epoch 15: loss: 0.449, accuracy: 81.02%, lr: 2.56e-04
Epoch 16: loss: 0.433, accuracy: 81.47%, lr: 2.38e-04
Epoch 17: loss: 0.427, accuracy: 81.86%, lr: 2.18e-04
Epoch 18: loss: 0.413, accuracy: 82.42%, lr: 1.96e-04
Epoch 19: loss: 0.395, accuracy: 83.25%, lr: 1.73e-04
Epoch 20: loss: 0.375, accuracy: 84.15%, lr: 1.50e-04
Epoch 21: loss: 0.337, accuracy: 85.87%, lr: 1.27e-04
Epoch 22: loss: 0.308, accuracy: 87.43%, lr: 1.04e-04
Epoch 23: loss: 0.280, accuracy: 88.74%, lr: 8.19e-05
Epoch 24: loss: 0.243, accuracy: 90.24%, lr: 6.18e-05
Epoch 25: loss: 0.217, accuracy: 91.64%, lr: 4.39e-05
Epoch 26: loss: 0.206, accuracy: 92.06%, lr: 2.86e-05
Epoch 27: loss: 0.199, accuracy: 92.32%, lr: 1.63e-05
Epoch 28: loss: 0.195, accuracy: 92.46%, lr: 7.34e-06
Epoch 29: loss: 0.193, accuracy: 92.53%, lr: 1.85e-06
Epoch 30: loss: 0.192, accuracy: 92.56%, lr: 0.00e+00


===============================================
Starting run 1/2 with 4 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 4
Model has 12,749,824 parameters
Epoch 1: loss: 1.725, accuracy: 28.37%, lr: 3.00e-05
Epoch 2: loss: 1.095, accuracy: 55.49%, lr: 6.00e-05
Epoch 3: loss: 0.489, accuracy: 80.84%, lr: 9.00e-05
Epoch 4: loss: 0.417, accuracy: 82.80%, lr: 1.20e-04
Epoch 5: loss: 0.203, accuracy: 91.34%, lr: 1.50e-04
Epoch 6: loss: 0.200, accuracy: 91.24%, lr: 1.80e-04
Epoch 7: loss: 0.193, accuracy: 91.32%, lr: 2.10e-04
Epoch 8: loss: 0.256, accuracy: 90.47%, lr: 2.40e-04
Epoch 9: loss: 0.114, accuracy: 95.31%, lr: 2.70e-04
Epoch 10: loss: 0.021, accuracy: 99.71%, lr: 3.00e-04
Epoch 11: loss: 0.037, accuracy: 98.76%, lr: 2.98e-04
Epoch 12: loss: 0.003, accuracy: 99.97%, lr: 2.93e-04
Epoch 13: loss: 0.133, accuracy: 95.14%, lr: 2.84e-04
Epoch 14: loss: 0.151, accuracy: 93.89%, lr: 2.71e-04
Epoch 15: loss: 0.179, accuracy: 92.47%, lr: 2.56e-04
Epoch 16: loss: 0.202, accuracy: 91.58%, lr: 2.38e-04
Epoch 17: loss: 0.186, accuracy: 92.17%, lr: 2.18e-04
Epoch 18: loss: 0.176, accuracy: 92.42%, lr: 1.96e-04
Epoch 19: loss: 0.174, accuracy: 92.44%, lr: 1.73e-04
Epoch 20: loss: 0.173, accuracy: 92.47%, lr: 1.50e-04
Epoch 21: loss: 0.171, accuracy: 92.49%, lr: 1.27e-04
Epoch 22: loss: 0.168, accuracy: 92.56%, lr: 1.04e-04
Epoch 23: loss: 0.160, accuracy: 92.75%, lr: 8.19e-05
Epoch 24: loss: 0.156, accuracy: 92.91%, lr: 6.18e-05
Epoch 25: loss: 0.150, accuracy: 93.24%, lr: 4.39e-05
Epoch 26: loss: 0.142, accuracy: 93.48%, lr: 2.86e-05
Epoch 27: loss: 0.133, accuracy: 93.77%, lr: 1.63e-05
Epoch 28: loss: 0.128, accuracy: 93.82%, lr: 7.34e-06
Epoch 29: loss: 0.127, accuracy: 93.82%, lr: 1.85e-06
Epoch 30: loss: 0.127, accuracy: 93.82%, lr: 0.00e+00


===============================================
Starting run 2/2 with 4 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 4
Model has 12,749,824 parameters
Epoch 1: loss: 1.719, accuracy: 28.28%, lr: 3.00e-05
Epoch 2: loss: 1.066, accuracy: 56.42%, lr: 6.00e-05
Epoch 3: loss: 0.415, accuracy: 83.28%, lr: 9.00e-05
Epoch 4: loss: 0.314, accuracy: 87.14%, lr: 1.20e-04
Epoch 5: loss: 0.287, accuracy: 88.10%, lr: 1.50e-04
Epoch 6: loss: 0.298, accuracy: 87.98%, lr: 1.80e-04
Epoch 7: loss: 0.294, accuracy: 87.96%, lr: 2.10e-04
Epoch 8: loss: 0.288, accuracy: 88.05%, lr: 2.40e-04
Epoch 9: loss: 0.286, accuracy: 88.09%, lr: 2.70e-04
Epoch 10: loss: 0.294, accuracy: 87.98%, lr: 3.00e-04
Epoch 11: loss: 0.286, accuracy: 88.08%, lr: 2.98e-04
Epoch 12: loss: 0.207, accuracy: 90.11%, lr: 2.93e-04
Epoch 13: loss: 0.008, accuracy: 99.99%, lr: 2.84e-04
Epoch 14: loss: 0.001, accuracy: 100.00%, lr: 2.71e-04
Epoch 15: loss: 0.001, accuracy: 100.00%, lr: 2.56e-04
Epoch 16: loss: 0.018, accuracy: 99.50%, lr: 2.38e-04
Epoch 17: loss: 0.000, accuracy: 100.00%, lr: 2.18e-04
Epoch 18: loss: 0.001, accuracy: 99.97%, lr: 1.96e-04
Epoch 19: loss: 0.000, accuracy: 100.00%, lr: 1.73e-04
Epoch 20: loss: 0.000, accuracy: 100.00%, lr: 1.50e-04
Epoch 21: loss: 0.000, accuracy: 100.00%, lr: 1.27e-04
Epoch 22: loss: 0.000, accuracy: 100.00%, lr: 1.04e-04
Epoch 23: loss: 0.000, accuracy: 100.00%, lr: 8.19e-05
Epoch 24: loss: 0.000, accuracy: 100.00%, lr: 6.18e-05
Epoch 25: loss: 0.000, accuracy: 100.00%, lr: 4.39e-05
Epoch 26: loss: 0.000, accuracy: 100.00%, lr: 2.86e-05
Epoch 27: loss: 0.000, accuracy: 100.00%, lr: 1.63e-05
Epoch 28: loss: 0.000, accuracy: 100.00%, lr: 7.34e-06
Epoch 29: loss: 0.000, accuracy: 100.00%, lr: 1.85e-06
Epoch 30: loss: 0.000, accuracy: 100.00%, lr: 0.00e+00


===============================================
Starting run 1/2 with 5 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 5
Model has 15,902,208 parameters
Epoch 1: loss: 1.653, accuracy: 31.26%, lr: 3.00e-05
Epoch 2: loss: 0.880, accuracy: 65.71%, lr: 6.00e-05
Epoch 3: loss: 0.637, accuracy: 73.64%, lr: 9.00e-05
Epoch 4: loss: 0.100, accuracy: 96.12%, lr: 1.20e-04
Epoch 5: loss: 0.003, accuracy: 100.00%, lr: 1.50e-04
Epoch 6: loss: 0.042, accuracy: 99.47%, lr: 1.80e-04
Epoch 7: loss: 0.008, accuracy: 99.91%, lr: 2.10e-04
Epoch 8: loss: 0.006, accuracy: 99.95%, lr: 2.40e-04
Epoch 9: loss: 0.068, accuracy: 98.42%, lr: 2.70e-04
Epoch 10: loss: 0.025, accuracy: 99.76%, lr: 3.00e-04
Epoch 11: loss: 0.004, accuracy: 99.96%, lr: 2.98e-04
Epoch 12: loss: 0.015, accuracy: 99.77%, lr: 2.93e-04
Epoch 13: loss: 0.000, accuracy: 100.00%, lr: 2.84e-04
Epoch 14: loss: 0.002, accuracy: 99.98%, lr: 2.71e-04
Epoch 15: loss: 0.000, accuracy: 100.00%, lr: 2.56e-04
Epoch 16: loss: 0.000, accuracy: 100.00%, lr: 2.38e-04
Epoch 17: loss: 0.000, accuracy: 100.00%, lr: 2.18e-04
Epoch 18: loss: 0.000, accuracy: 100.00%, lr: 1.96e-04
Epoch 19: loss: 0.000, accuracy: 100.00%, lr: 1.73e-04
Epoch 20: loss: 0.001, accuracy: 99.98%, lr: 1.50e-04
Epoch 21: loss: 0.000, accuracy: 100.00%, lr: 1.27e-04
Epoch 22: loss: 0.000, accuracy: 100.00%, lr: 1.04e-04
Epoch 23: loss: 0.000, accuracy: 100.00%, lr: 8.19e-05
Epoch 24: loss: 0.000, accuracy: 100.00%, lr: 6.18e-05
Epoch 25: loss: 0.000, accuracy: 100.00%, lr: 4.39e-05
Epoch 26: loss: 0.000, accuracy: 100.00%, lr: 2.86e-05
Epoch 27: loss: 0.000, accuracy: 100.00%, lr: 1.63e-05
Epoch 28: loss: 0.000, accuracy: 100.00%, lr: 7.34e-06
Epoch 29: loss: 0.000, accuracy: 100.00%, lr: 1.85e-06
Epoch 30: loss: 0.000, accuracy: 100.00%, lr: 0.00e+00


===============================================
Starting run 2/2 with 5 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: False
  layers: 5
Model has 15,902,208 parameters
Epoch 1: loss: 1.682, accuracy: 30.39%, lr: 3.00e-05
Epoch 2: loss: 0.903, accuracy: 65.03%, lr: 6.00e-05
Epoch 3: loss: 0.728, accuracy: 69.92%, lr: 9.00e-05
Epoch 4: loss: 0.109, accuracy: 95.94%, lr: 1.20e-04
Epoch 5: loss: 0.098, accuracy: 96.11%, lr: 1.50e-04
Epoch 6: loss: 0.100, accuracy: 96.00%, lr: 1.80e-04
Epoch 7: loss: 0.100, accuracy: 95.97%, lr: 2.10e-04
Epoch 8: loss: 0.051, accuracy: 98.72%, lr: 2.40e-04
Epoch 9: loss: 0.005, accuracy: 99.91%, lr: 2.70e-04
Epoch 10: loss: 0.002, accuracy: 99.98%, lr: 3.00e-04
Epoch 11: loss: 0.010, accuracy: 99.86%, lr: 2.98e-04
Epoch 12: loss: 0.005, accuracy: 99.93%, lr: 2.93e-04
Epoch 13: loss: 0.004, accuracy: 99.94%, lr: 2.84e-04
Epoch 14: loss: 0.004, accuracy: 99.96%, lr: 2.71e-04
Epoch 15: loss: 0.000, accuracy: 99.99%, lr: 2.56e-04
Epoch 16: loss: 0.000, accuracy: 100.00%, lr: 2.38e-04
Epoch 17: loss: 0.001, accuracy: 99.98%, lr: 2.18e-04
Epoch 18: loss: 0.000, accuracy: 99.99%, lr: 1.96e-04
Epoch 19: loss: 0.000, accuracy: 99.99%, lr: 1.73e-04
Epoch 20: loss: 0.000, accuracy: 100.00%, lr: 1.50e-04
Epoch 21: loss: 0.000, accuracy: 100.00%, lr: 1.27e-04
Epoch 22: loss: 0.000, accuracy: 100.00%, lr: 1.04e-04
Epoch 23: loss: 0.000, accuracy: 100.00%, lr: 8.19e-05
Epoch 24: loss: 0.000, accuracy: 100.00%, lr: 6.18e-05
Epoch 25: loss: 0.000, accuracy: 100.00%, lr: 4.39e-05
Epoch 26: loss: 0.000, accuracy: 100.00%, lr: 2.86e-05
Epoch 27: loss: 0.000, accuracy: 100.00%, lr: 1.63e-05
Epoch 28: loss: 0.000, accuracy: 100.00%, lr: 7.34e-06
Epoch 29: loss: 0.000, accuracy: 100.00%, lr: 1.85e-06
Epoch 30: loss: 0.000, accuracy: 100.00%, lr: 0.00e+00


===============================================
Starting run 1/2 with 1 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: True
  layers: 1
Model has 3,292,672 parameters
Epoch 1: loss: 1.793, accuracy: 25.30%, lr: 3.00e-05
Epoch 2: loss: 1.173, accuracy: 52.43%, lr: 6.00e-05
Epoch 3: loss: 0.008, accuracy: 100.00%, lr: 9.00e-05
Epoch 4: loss: 0.000, accuracy: 100.00%, lr: 1.20e-04
Epoch 5: loss: 0.000, accuracy: 100.00%, lr: 1.50e-04
Epoch 6: loss: 0.000, accuracy: 100.00%, lr: 1.80e-04
Epoch 7: loss: 0.000, accuracy: 100.00%, lr: 2.10e-04
Epoch 8: loss: 0.000, accuracy: 100.00%, lr: 2.40e-04
Epoch 9: loss: 0.000, accuracy: 100.00%, lr: 2.70e-04
Epoch 10: loss: 0.000, accuracy: 100.00%, lr: 3.00e-04
Epoch 11: loss: 0.000, accuracy: 100.00%, lr: 2.98e-04
Epoch 12: loss: 0.000, accuracy: 100.00%, lr: 2.93e-04
Epoch 13: loss: 0.000, accuracy: 100.00%, lr: 2.84e-04
Epoch 14: loss: 0.000, accuracy: 100.00%, lr: 2.71e-04
Epoch 15: loss: 0.000, accuracy: 100.00%, lr: 2.56e-04
Epoch 16: loss: 0.000, accuracy: 100.00%, lr: 2.38e-04
Epoch 17: loss: 0.000, accuracy: 100.00%, lr: 2.18e-04
Epoch 18: loss: 0.000, accuracy: 100.00%, lr: 1.96e-04
Epoch 19: loss: 0.000, accuracy: 100.00%, lr: 1.73e-04
Epoch 20: loss: 0.000, accuracy: 100.00%, lr: 1.50e-04
Epoch 21: loss: 0.000, accuracy: 100.00%, lr: 1.27e-04
Epoch 22: loss: 0.000, accuracy: 100.00%, lr: 1.04e-04
Epoch 23: loss: 0.000, accuracy: 100.00%, lr: 8.19e-05
Epoch 24: loss: 0.000, accuracy: 100.00%, lr: 6.18e-05
Epoch 25: loss: 0.000, accuracy: 100.00%, lr: 4.39e-05
Epoch 26: loss: 0.000, accuracy: 100.00%, lr: 2.86e-05
Epoch 27: loss: 0.000, accuracy: 100.00%, lr: 1.63e-05
Epoch 28: loss: 0.000, accuracy: 100.00%, lr: 7.34e-06
Epoch 29: loss: 0.000, accuracy: 100.00%, lr: 1.85e-06
Epoch 30: loss: 0.000, accuracy: 100.00%, lr: 0.00e+00


===============================================
Starting run 2/2 with 1 layers

Config:
  num_train_samples: 100000
  num_eval_samples: 10000
  sequence_length: 128
  num_jumps: 15
  embed_dim: 512
  ffn_dim: 2048
  num_heads: 8
  reorder_and_upcast_attn: True
  scale_attn_by_inverse_layer_idx: False
  scale_attn_weights: True
  epochs: 30
  warmup_steps: 7820
  lr: 0.0003
  betas: (0.9, 0.98)
  batch_size: 128
  custom_attention: True
  layers: 1
Model has 3,292,672 parameters
Epoch 1: loss: 1.785, accuracy: 25.75%, lr: 3.00e-05
Epoch 2: loss: 1.122, accuracy: 54.68%, lr: 6.00e-05
Epoch 3: loss: 0.006, accuracy: 100.00%, lr: 9.00e-05
Epoch 4: loss: 0.000, accuracy: 100.00%, lr: 1.20e-04
Epoch 5: loss: 0.000, accuracy: 100.00%, lr: 1.50e-04
Epoch 6: loss: 0.000, accuracy: 100.00%, lr: 1.80e-04
Epoch 7: loss: 0.000, accuracy: 100.00%, lr: 2.10e-04
Epoch 8: loss: 0.000, accuracy: 100.00%, lr: 2.40e-04
Epoch 9: loss: 0.000, accuracy: 100.00%, lr: 2.70e-04
Epoch 10: loss: 0.000, accuracy: 100.00%, lr: 3.00e-04
Epoch 11: loss: 0.000, accuracy: 100.00%, lr: 2.98e-04
Epoch 12: loss: 0.000, accuracy: 100.00%, lr: 2.93e-04
Epoch 13: loss: 0.000, accuracy: 100.00%, lr: 2.84e-04
Epoch 14: loss: 0.000, accuracy: 100.00%, lr: 2.71e-04
Epoch 15: loss: 0.000, accuracy: 100.00%, lr: 2.56e-04
Epoch 16: loss: 0.000, accuracy: 100.00%, lr: 2.38e-04
Epoch 17: loss: 0.000, accuracy: 100.00%, lr: 2.18e-04
Epoch 18: loss: 0.000, accuracy: 100.00%, lr: 1.96e-04
Epoch 19: loss: 0.000, accuracy: 100.00%, lr: 1.73e-04
Epoch 20: loss: 0.000, accuracy: 100.00%, lr: 1.50e-04
Epoch 21: loss: 0.000, accuracy: 100.00%, lr: 1.27e-04
Epoch 22: loss: 0.000, accuracy: 100.00%, lr: 1.04e-04
Epoch 23: loss: 0.000, accuracy: 100.00%, lr: 8.19e-05
Epoch 24: loss: 0.000, accuracy: 100.00%, lr: 6.18e-05
Epoch 25: loss: 0.000, accuracy: 100.00%, lr: 4.39e-05
Epoch 26: loss: 0.000, accuracy: 100.00%, lr: 2.86e-05
Epoch 27: loss: 0.000, accuracy: 100.00%, lr: 1.63e-05
Epoch 28: loss: 0.000, accuracy: 100.00%, lr: 7.34e-06
Epoch 29: loss: 0.000, accuracy: 100.00%, lr: 1.85e-06
Epoch 30: loss: 0.000, accuracy: 100.00%, lr: 0.00e+00
